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Figure 1.3-1: 8-Server Chassis Example 

1.4.2 MAC Features 

PCI Express VI. Oa compliant 

Can be configured to operate as a Shared Port or a Non-Shared Port. 
Layered architecture with Configuration Block, Presentation Layer, Transaction 
Layer, Data Link Layer, and Logical Physical Layer. 
Supports xl, x2, x4, and x8 links 

Supports up to 16 OS Domains and up to 8 Virtual Channels with a maximum of 
16 different OSDA^C combinations. 

• Provides a 64-bit data path at 250MHz 

• Supports configurable maximum packet sizes in the range of 128B to 1KB 

• Performs PHY level link negotiations 
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• The MAC is divided into 8 sub-modules, where each sub-module consists of 8 
physical lanes that can be negotiated to operate as 1 x8 link, or 2 links of xl, x2, 
or x4 lane widths 

• 64KB of Receive Data Buffer and 61 44 Bytes of Header Buffer (256 outstanding 
transactions per port) 

• Supports cannibal ization of Receive Data Buffers and Receive Header Buffers. 
An 8x link can use all 128KB of Receive Data Buffer and all 512 Header Credits. 

• 8-bit Single Error Correction and Double Error Detection (SECDED) on Rx Data 
& Header Buffers 

• Maintains Transaction Ordering within each egress port per OSD/ 

• Add Power States supported 

• Add latency statements (mention 'fast path' feature) 

1.4.3 Switch Core Features 

The switch core is responsible for forwarding transactic^^p^et da^i^giS'the 16 RX 
MAC interfaces over to the 16 TX MAC interfaces. J^ts win%e impllmented as a data 
crossbar and a transaction scheduler. The high level ful^ons ^^e switch core are: 

• Implements a transaction scheduler that alfbws fcjr ]^^phimable fairness at the 
different arbitration levels (per input p^A/^^er ij^ut port per VC) that transfers 
one transaction per clock from a source^afidestihatlon 

• Provides virtual output queue (W>Q) s^^tu%^ffthat each input feeds its own virtual 
switch from a software configu^on vie^oint 

• Implements a data crossbai^at el^kntl^^ moves the packet data from the 16 RX 
MACs over to the 1 6 P^MA^k 

• Provides an interfa^e^ghe^^A^it^^anagement logic for transactions that terminate 
in our device 

• Xookup inter^^for PQ bfi|ge routing tables (support for address routing, ID-based 
bus routing, ano&m^Si%^^ 





Ndxi^^a^ro Architecture 
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2.1 D tailed Block Diagram 

2.1.1 Chip Level Block Diagram 

2.1.2 Block Descriptions 

2.2 System View of Chip 

2.2.1 PCI-Bus enumeration - Prashantha 

2.2.2 Transaction Routing 

According to the PCI Express Base Spec (la), there are three ways t^m^e a pMl 
address, ID, or implicit. The type of routing used is dependent pn the%)^^eldw the 
header (and the routing sub-field, r[2:0], of the Type field for'^^age mnsa^lons). 
Each routing type v^ill be covered in the next few section^^ r 

In PCI Express, each switch is logically a set of virti 
below. 





;es connected as shown 



Virtual 



Virtual PCI Bus 4 



Endpoint 



Virtual PCI Bus 6 



PCI-Express/ 
PCI bridge 



Endpoint 



Endpoint 



Figure 2.2-1: 
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This topology is a simplified version of our switch since it only shows 3 ports instead of 
16, but it illustrates the idea by having 4 virtual P2P bridges, and 5 PCI buses, numbered 
2 through 6. This topology will be used to show examples of how our routing will work 
and also shows where our on-chip device, the Common On-chip Processor (COP), will be 
located in the topology once discovery is completed by the root complex. 



2.2.3 P2P bridges 

A P2P bridge forwards memory and I/O transfers from one PCI bus over to the other side 
of the bridge where another PCI bus resides. Each P2P bridge has a 256&iieader register 
space that is accessible by the system and is used to set up all the parame^^)r the ^ 
bridge. The following diagram shows the bridge header (Also callecj^a Type^^ad^^ 
with the fields highlighted that apply to transaction routing. 



Device ID 


^^%yend 


IrlD" 


Status 


""^mnSahd 


Class Code ^\ ^ ''p 


Revision ID 


BIST 


Header Type 


Prima^I^aten^^jpier 


Cache Line Size 


Base Address Jt^i^tgr^ f 


Base Addr^ss^ReM^W^l^S^ 


Secondary Latency 
Timer 


Subordinate Bus 
^ Number 


Secondary Bus Number 


Primary Bus Number 


Secondary Status .-^^^ 


I/O Limit 


I/O Base 


^ • ^ ' M 




irefotohaMe Memoiy Limit 


Prefetehable Memory Base 


Prefetehable Base Upper 32 Bits 


i t Prefetehable Limit Upper 32 Bits - ^ ^ 


I/O Limit Upper 16 Bits I/O Base U 


pperl6Bit3 




Capability Pointer 


-^'^^^ll|l,E>$ansion ROM Base Address 


^nag^Control " Interrupt Pin 


Interrupt Line 



re^i^er^fire documented in the P2P Bridge spec (pp. 26-54), with some 



Figure 2.2-2: 

All Qf1[hes:e 

chaises madeinTd Express (base spec la, section 7.5.3, pp. 327-330). The addressing 
start^^the top right of the table as register address 0x0 (lower 8 bits of Vendor ID) and 
contin^^toJtne bottom left as register Ox3F (upper 8 bits of Bridge Control). 



2.2.4 Address routing 

Address routing is used for memory (32-bit and 64-bit) and I/O transactions that must 
pass through our switch. Each will be discussed in detail in the next few sections. 

The header fields important to address routing are shown in the next two diagrams. 

+0 I +1 I +2 I +3 
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Figure 2.2-3: Header format for 64-bit Transactions (prefetchable memory) 
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Figure 2.2-4: Header format for 32-bit Transa^ons^|jemory ^ I/O) 



2.2.4.1 A Memory transactions 

In P2P bridges, there are 2 different ac 



rMges that can be defined, the 32-bit 
memory range (which is required)^3 ti^^^bif||^mory range (which is optional). Our 
switch will implement both rangeC^yhe brifge header registers for both memory 
transaction types will need to^ u]^¥th^system. The memory TLP in PCI express 
does not specify which of Jiie two f^es^'ansaction will fit into (the SAC and DAC 
address cycles don't e^sf^&eiie^ d^^PCI), so both the memory range and the 
prefetchable memory#nge m^ be compared against for a 32-bit transaction since the 
prefetchable ranggxan b^elo^t^SB. A memory transaction is defined as follows: 



TLP T^^^^^ 


V Fmtfl:01 


Typef4:01 


Description 




. y 00 


0 0000 


Memory read 




1 01 








00 


0 0001 


Memory read-locked 


|MRdLe^S^s^ 


01 








10 


0 0000 


Memory write 




11 







Table 2.2-1: 

The LSB in in the Fmt[l :0] field each of the three memory request types is high if a 64- 
bit address is present and low for a 32-bit address. 

Each memory request (for decode purposes of the bridge) is addressing 1 MB of memory, 
so the lower 20 bits are assumed to be zero to match the memory space to the base/limit 
range. Note that any memory transaction that is addressing less than 4 GB is always a 
32-bit transaction. Only transactions above 4 GB may use 64-bit address mode TLP and 
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are defined as prefetchable transactions that use the prefetchable base/limit registers. 
Following is a diagram that shows one way the system could provision the two memory 
address ranges. 



Primary bus 



Secondary bus 



4GB Boundary 



Inth 





Figure 2.2-5 



«Jdress 0x40 0000 going from north to south would still get through. 
Als(| the addrell%>^^0_0000_0000 would go through from north to south. The address 
OxF^^FFFF|_FFFF_FFFF from south to north would also pass across the bridge in this 
confi^lti^ 

Bottom line is that for 32-bit memory transactions, both the standard memory range, the 
prefetchable range, and the memory-mapped BAR space must be examined to process a 
transaction. For 64-bit transactions, only the prefetchable range must be examined. 

2.2.4.1.2 I/O transactions 

I/O transactions are limited to a 32-bit space, with a base and a limit being specified 
within the P2P bridge header just like memory transactions. The following fields are 
used to define the I/O TLP: 
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TLP Type 


Fmt[l:0] 


Typef4:01 


Description 


lORd 


00 


0 0010 


I/O read request 


lOWr 


10 


0 0010 


I/O write request 



Table 2.2-2: 



This type of access is 41cB-aligned (lower 12 bits are assumed to be zero for matching the 
range), so only the upper 20 bits of an I/O transfer are compared for a match in the 



address range. For transactions traveling downstream, the I/O address must match within 
the I/O range, and the transaction will be forwarded downstream. For trai^^tions 




0x0 



Figure 2.2*6: 

For example, using this configuration, a transaction of 0x40_0000 is presented on the 
bridge's primary bus interface. Since it matches within the range programmed in the I/O 
base/limit registers, the transaction is forwarded downstream to the bridge's secondary 
bus. Next, a transaction for address 0x0 is presented on the bridge's secondary bus. 
Since this is NOT within the base/limit range, this transaction is forwarded up to the 
bridge's primary bus. In order to disable this range, the limit must be programmed to a 
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smaller number than the base. This means all downstream transactions will be rejected, 
and all upstream transactions will be allowed through. In the event of an error (an 
address of OxlO_0000 is presented at the primary bus for example), the transaction is 
discarded and an Unsupported Request (UR) message must be sent back out the port. 



2.2,4.1.3 Peer to Peer support 

Upstream traffic in a bridge is only allowed to cross the bridge if it's outside the 
{base, limit} range as described above in the examples. This would mean that if a 
downstream port initiates a read or write request, the address would need to fall outside 
the ranges of that port's P2P bridge. 




Assuming it's outside the range, the address is then compared with th^^Qt's ado^ss 
range and the other endpoint's address ranges. For the root's r^ge, ^^r^s^ust 
again be outside the {base,limit} tuple. If it is, the transactip^tf^rwa^d up to the root. 
If the address is within the {base,limit} range, the other^p^omt's%jpige^re checked for 
downstream matches. If a match is found, the transaOiq^ isl^warde|^b the other 
endpoint. If no match is found, the bridge issues a uft^cket%^ to'^the endpoint. 

The same mechanism will be used for downstg^a^^^iN^^r support as well. The 
Root will be compared first since it is the Ijjcely rccipt^t, but if no match is found the 
other downstream ports will be checked ^^^^^e transaction should be routed there 
instead. ^'^^ 




So, our architecture will be abl^o su^gort IM; (inter-processor communication) and 
downstream peer-to-peer if t|ij| E^ROI^r,0Visions our switch such that two roots can 
appear on the same hiera^y. .^u^^dre^Tlookup module logic does not preclude this 
from happening. WeM^^h^^i^it de^fS per base/limit pair that allows for a very 
flexible mapping. This ^t v\%|define each port as either upstream or downstream for 
each port for eacjt?^^^^jhiis allp'Ws the address routing logic to map addresses into the 
ranges and all^ foi^4ip^^^peer-to-peer and downstream peer-to-peer in the same 
architecture^^tlii^^dr^gurable by the EEPROM as to which ports "appear" on each 
OSD di}rin^device^;?c(^ry. Note that only the trusted software entity (either a driver 
runnb^n 6^^SD ^r the I2C interface software) is allowed to modify the routing bits, 
so C^Ds will Mt^e^able to make other OSDs "appear" on their PCI hierarchy by 
accide 



It. Anyjpeer to peer configuration is application specific on how it handles the 



address 



2.2.5 ID routing 

ID routing is used for configuration request TLPs, completion TLPs, and optionally 
vendor-defined messages. The TLP header shown is only 12 bytes, but some ID routed 
packets can have a 1 6 byte header depending on the TLP type. 
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Byte 8 



R 






R 


TC 


R 


T 
D 


E 
P 


Attr 


R 


Length 




rhese bytes depend on TLP type. 






Function 
Number 


These bytes depend on TLP type. 



Figure 2.2-7: Header format for ID-based Transactions 



For this type of transaction, the bus_number[7:0] field is used to determine^which port a 
TLP needs to be sent to. If the bus_number field indicates one of our swi{%'s VP2P 
bridge headers is being addressed, the COP will use the device number fielcn^^deteiroine 
which header is being addressed and take the appropriate action. ^ ^ 



The header fields are shown in the table below for these transal 




TLP Type 


Fint[l:01 


Type[4:01 


^Ig^s^cription 


CfgRdO 


00 


0 0100 


A. ^ '^^figdration Read Type 0 


CfgWrO 


10 


0 0100 


CopfTguration Write Type 0 


CfgRdl 


00 


0 0101^^^ 


^^ftl?^. Configuration Read Type 1 


CfgWrl 


10 




V "Configuration Write Type 1 


Msg 


01 




. Message Request - Routed by ID 


MsgD 




/^^\^ 


p Message Request with data payload - 
routed by ID 


Cpl 


00 / 




Completion Without Data -Used for I/O 
and Configuration Write Completions and 
Read Completions (I/O, Configuration, or 
Memory) with Completion Status other than 
Successful Completion 


CplD 




0 1010 


Completion with Data - Used for Memory, 
I/O, and Configuration Read Completions. 


CplLk / 


XT 


0 1011 


Completion for Locked Memory Read 
without Data - Used only in error case. 






0 1011 


Completion for Locked Memory Read - 
otherwise like CplD. 




Table 2.2-3: 



Note that configuration requests 
function number fields shouldn' 
any multi-function devices. 



will always be sent to the COP for processing. The 
t need to be used by the switch since we won't support 



For messages and completions, the bus number is compared against the programmed 
secondary and subordinate bus nurnbers for the port's P2P bridge header. An example is 
shown below from a software topology point of view. 
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Primary bus Secondary bus 



Subordinate bus 
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0x4 



Root Complex 



Primary bus Secondary bus 
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Port 



0x2 



0x3 



0x7 



Virtual 
P2P bridge 



Bus 4 



]5 



Bus 2 



Virtual 
P2P bridge 



Bus 3 



Virmal 
P2P bridge 



^usS 



Prijnaiy'^us Secondary bus 



0x3 



0x5 



Subordinate bus; 



0x5 



Virtual 
P2P bridge 



Port 3 



Endpoint 



Bus 6 



Primary bus Secondary bus 



0x3 



0x6 



PCI-Express/ 
PCI bridge 



Endpoint 





Figure 2.2-8: "^S^ 

Example 1 

A configuration packet arrives on port 1 , 

• If the packet is a type 0 configurationy^ket^e ^dj^essJookup_module 
immediately returns port 16 as the destin%6n sin^pihis packet will be handled by the 

COP. • 

• If the packet is a type 1 configumi|pn packet, ^he Bus Number field of the header is 
compared with Port 1 's PZ^^dge"^^^^ and subordinate bus numbers. If (bus 
number <3) or (bus nu|rf6eV > >5yhe tpfnsaction is dropped and a UR transaction is 
signaled back to the.^f^^i^Q sin 



^-t isn't in the range of the 

{ secondary,subord1nate ^"^tfiple. ■ 

Else if the Bu^^mb^^ierci^ to 3 (the secondary bus number of the P2P 
u^;A.r^ ^-'^'^-'^^'^I'^flke^^ one of the downstream virtual P2P bridges 

. ligain, the AddressJookup_module returns port 16, and the packet 

|> 

, -^-^^ dary,subordinate} tuples of all 16 destination ports are compared to 

|ee which^i^ath^ packet should be routed to (ie, secondary_bus[port 3] < 

ket_bu4_number <= subordinate_bus[port 3], etc.). Once a match is found, the 



bridge header), 
inside t^e^f 
is sent ^ the C 

El 




port is returned. Note this could still return port 16 which would mean the 
addressing the COP's type 0 header space. 



Subordinate bus 



0x7 



2.2.6 Implicit routing 

Message transaction are the only type that can use implicit routing, and the port logic 
should always examine the r[2:0] sub-field for message packets to see how to handle 
them. The following table shows the decoding of the various values. 



r[2:01 I Definition 
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000 


Routed to Root Complex 


001 


Routed by Address 


010 


Routed by ID 


oil 


Broadcast from Root Complex 


100 


Local - Terminate at Receiver 


101 


Gathered and Routed to Root Complex (see section 5.3.3.2.1 of 
Base Spec) 


110-111 • 


Reserved - Terminate at Receiver 



Table 2.2^: 

If a messages transaction's value of r[2:0] is either 001 (address routing)^^^^0 (ID . 
routing), the port logic will follow the same sequence of events listed above i%sectiMs 
2.2.4 and 2,2.5. If the value is one of the other values, the route is "i^j^t" si^^^e 
definition of the encoding tells the port logic what to do. 




2.2.6.1.1 Routed to Root complex 

Our chip will have a root complex field assigned to eacK i^Bt port Bfs^e'd on the source 
OSD to handle this type of packet. The Addresslool&^moa^tewilf simply index the 
root table (by OSD number) and return the rootj^J numli^gr to tbe MAC. 

2.2.6.1.2 Broadcast from Root Compl^ \ 

ate queues. 



This only applies to two types of TLPs, PM%Jflrn_(^ftlnd Unlock. Unlock will not 
actually be broadcast (which is leg; 
which port has been locked and un 



actually be broadcast (which is legal^af^^[nf%^thQ.%ec) since our switch will track 



,ck the a^proB 



PME__Tum_Off will be handted 
lookup logic 



2.2.6.1.3 Local - Je 

This type of packe^iH|§i^ 




e a^te§s_lookup_module logic and the COP. The 



ina^^t Receiver 

s iTreturned from the Address_lookup_module as routed 



to port 16 fo|,4h^^dfin^to handle (ports 0 through 15 are the actual data ports). 
2.2.6.1 i4 mithereamnd Routed to Root Complex 

This|fype of rc);^n|j^§ only used whenever the switch receives a PME_TO_Ack 
mes^jges from downstream ports. The Address_lookup_module will again return port 
16. T^COP^vill scoreboard the responses from all downstream ports. Once the COP 
has rec^vid>'CPME_TO_Ack packet from each downstream port, it then returns a single 
PME_TO_Ack packet back to the root complex and sets the r[2:0] sub-field to 3'blOl to 
tell the root complex that all downstream ports in the switch responded to the 
PME_Tum_Off message. 



Note that a timer must also be implemented in this logic to avoid deadlock, since the 
return of the PMC TO Ack packet back to the Root should not be blocked due to one 
device's failure to send a PME_TO_Ack in a reasonable amount of time (no time given 
in the spec for this, but 100 ms is mentioned elsewhere as a timeout number, so maybe 
we'll use that). 
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2.2,6, L 5 Reserved - Terminate at Receiver 

These are nearly identical to the Local - Terminate at Receiver types of routing. This 
transaction will return port 16 from the Address_lookup_module and will be sent to the 
COP. 



2.2.7 Isolation of OS Domains 

As TLPs are received on a Rx MAC, they are queued into a structure that is separated by 
OS domain. PCI-EX ordering rules are always adhered to within an OS d^ain but 
never across OS domains to avoid head-of-line blocking conditions. Flo^^^trol is^also 
isolated by OS domain and VC. The PCI-EX base specification def^^ flow d)^trQlfe)n a 
per VC basis. The switch is configured to allow assignment of VC re^te^peK^ 
domain to enable absolute flow control (i.e. flow control per ^^^^r O^^^^rf). 
Therefore, from the Rx MAC perspective, each OS domain^ial!^^d itsvown buffer, 
queue, and flow control resources. 

In the Switch Core, all transactions are arbitrated usi^^simpi^ound-robin algorithm, 
where fairness is enforced at 3 different levels 3^^^^|^jfe|ltion, ^C arbitration, and OSD 
arbitration. 



The COP manages the different OS domain"^;^p>eariitg as a Type 0 "shared" I/O 
device. After switch configuration i&^ii^plet^veac^ooi 



%1T 



loot Port is assigned to a 



configuration i 

particular OS domain, and for eacJi^Root Po^bus^hat ports are targets on that bus and 
their corresponding port/OSD^^ny %^oo&ort has no knowledge of the presence of 
any other Root Port. If a RQ0it,Pd]^usr^|ggft a shared I/O port, it has no knowledge of 
the other OS domains tha^an ^cc^^e sHared I/O port. Therefore, at the COP level, 
OS domains are comgj[^te^§ofated. 



le^a scenario where the console management software is 
^er. In that case, the OSD is allowed to manage the device- 



One exception totfe^^, 
running on a^j^^^c si 

specific regfsters'^^cd^^ssing them through the OOP's type 0 header space. To become 
a trusted bl^e serv^the^driver on that OSD must present a key to the switch. If that 
key marcfies'^i trust|d key, the COP sets a bit that tells that OSD it can now access our 
devi|te-specific r^i^ers. 



2.3 W$$3^low Examples 
2.3.1 Address-based Requests 

All Memory transactions (Reads, Writes, and Completions) and I/O transactions (Reads, 
Writes, and Completions) are address-based. The following describes the data flow of an 
address-based TLP: 

1 . The initiator generates a Posted Request. 

2. The Rx MAC receives the Posted Request from the SERDES. 

3. The Rx Physical Layer Module (RxPHY) performs 8b/l Ob decoding, de- 
scrambles the data stream, and performs clock compensation between the 
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extracted clock and the core clock. The data stream is in the form of 8 bits per 
clock (at 250 MHz.). It will also perform lane-to-lane deskew if lane width is 
greater than one. 

The Rx Data Link Layer Module (RxDLLM) converts the data from the RxPHY 
to a 64-bit data path at 250MHz regardless of the link width. For example, if the 
link width is xl, it will take 8 elks to gather 64 bits at 250 MHz. The Rx DLLM 
then checks the Sequence Number and LCRC as the TLP is passing through to the 
Transaction Layer. The Rx DLLM also removes the Sequence Number and the 
LCRC from the TLP before it forwards the packet to the Rx Transaction Layer 
Module (TLM). 

The RxTLM performs TC to VC mapping and OS domain identi^Mjon to 
determine if there are enough header and data resources for the partici 
there is, the TLP is stored in the header buffer and the data bu| _ 
contains payload) and the starting address of the Headers for%^^^^fi 
Address Lookup FIFO. All flow control calculation^|m^plei^Cnte<^n the Rx 
TLM and are scheduled to transmit every time a ' 
de-allocated. 

The Rx TLM stores the header address of eaeif^LP ir^^Address Lookup FIFO. 
When the Address Lookup Interface is re^[y to pi-esent sf^ansaction to the 
Address Lookup Module in the Switcl^(^^^®a^^^e header address stored in 
the Address Lookup FIFO to acces^^fe mutilf^^^^ (in this case, the 32- 

bit or 64-bit address) from the He^^Buffer. Tl^e following information is 
passed into the Lookup Modu^^g^ 

• address[43 :0] - coiil^ns eimlir. t?i| 



per 44 bits of a 64-bit memory 
transaction, the uppel^ bits Sa 32-bit memory transaction, or the upper 
20 bits of anj^^sa^>^ 

lookup_t)^[2:(^ - i^sed to specify the transaction type as shown 
in the foll^^£g|^ble.^!e types relative to this transaction type are 
highlighmd 



is stotlc 




or^FC credit is 




Cmmp type[2:01 


Transaction definition 




■feiiSifeeMOfiyjt^aB^^ 






1 3'bOlO 


ID-based transaction 


mmi 




3'blOO 


Routed to root complex 


3'blOl 


Broadcast from root complex 


3'bllO 


Terminate at receiver 


3'blll 


Reserved 



• port is downstream - lets the Address Lookup Module know what the 
most likely lookup sequence should be (routed to root complex). 

• tc[2:0] - used to help determine the egress_qid[3:0] for this transaction. 

• osd[3:0] - this is used in conjunction with the tc[2:0] field to determine 
the egress_qid[3:0]. 
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Fast Path 

If the Address Lookup FIFO is empty, the routing information is immediately 
presented to the Address Lookup Module in the Switch Core. 




7. The Address Lookup Module(ALM) figures out the root port and which ports are 
connected to this ingress port. Then it begins walking the entries (up to 16) to 
find the base/limit pair that matches the address range, based on the array of valid 
ports to search. Note that the root entry is searched first if the 
'portJs__downstream' bit is set. The ALM also determines the egress QID 
number (ranging from 0 to 1 5). 

8. The ALM returns the egress_qid[3:0] and destination_port[4:0] t^^^ Address 
Lookup Interface in the RxTLM. The transaction will not be submff 
Transaction Queues until there is enough data, or until the entire packef%,st#ed 
when the egress_cut_through_enbl is not set. 

9. When the Address Lookup Module returns the egress^c&l[3:0] \ 
destinationj)ort[4:0], the Address Lookup Interface^or^tie information in a 32 
deep FIFO. This ALM Response FIFO is necessa^^ caselfeg^jMnsaction 
Queues are not able to accept the transactions^4^^st a^e AdSress Lookup 
Module Interface is able to submit them. Whe'^tihiis FlP%]becomes half full, it 
backpressures the Address Lookup Mod^l^^gt^ 

Address Lookup Module Interface mlf^ot'l^ue af^^ore requests to the Address 
Lookup Module until the ALM Rj^ons^IFO^sJess than half full. 

10. The state of each transaction queue r%|jty or^on-empty) in the Rx PM is sent to 
the Switch Core. The Swit^^<^f^ll%u^^ Presentation Module asking for 
the next transaction in a pam^lar q^^e.^lie Presentation Module will resolve 
the transaction ordering^d retta a jacket header to the Switch Core. The 
Switch Core will che^fc flo^ontmJ^credits and will queue the transaction unless 
it does not pass fl^co^rolgtoi^ The Switch Core will use the query 
information to^venta^y'requ^ the Presentation Module to transmit a packet. 

11. The Presentation ^^yer^:^ read the packet information and pop the packet out of 
the Trans'^pfiii^g^^ It will pass the packet information to the packet 

12. The|PackeF§xhelu will take the packet information, and create a TLP by 
^Jl^il^the onliinaifTLP header and appending the TLP data. This packet is sent to 
theO 





Once the Switch Data Mover starts transferring the data, the TLP is stored in the 

L^ Mini-FIFO 64 bits at a time. As the TLP is read from the Mini-FIFO, 
ti^ffared 10 Header is inserted before the TLP Base Header (if the endpoint is a 
shared I/O). This is fed into the Tx Data Link Layer Module (DLLM). 

14. The Tx DLLM receives the TLP from the Mini-FIFO and starts calculating the 
LCRC along with appending the Sequence Number to the start of the TLP. 

15. The TLP is forwarded to the Tx Physical Layer Module (PLM) where it is 
scrambled and decoded from Sbits to 1 Obits and sent out on the wire. 

2.3.2 Configuration Requests and Completions 

Configuration Request and Completions follow the same steps in the previous section 
except that they are ID based transactions. Steps 6 and 7 would be replaced with: 
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6. 



The Rx TLM stores the header address of each TLP in the Address Lookup FIFO. 
When the Address Lookup Interface is ready to present a transaction to the 
Address Lookup Module in the Switch Core, it uses the header address stored in 
the Address Lookup FIFO to access the routing information (in this case, the bus 
number) from the Header Buffer. The following information is passed into the 
Lookup Module: 

• address[43 :6] - contains the 8-bit bus number. 

• lookup_type[2:0] - field is used to specify the transaction type as shown 
in the following table. The types relative to this transaction type are 



Lookup type[2:0| 




Transaction definitions^ 


3'bOOO 


32-bit memory transaction 


3'bOOl 


64-bit memory tranlfetip^i 






3'bOll 


32-bit UCLt^sacti^ 


3'blOO 


RoutQ^llQ root ?^plq¥ 


3'blOl 


Broa^e^t fr^j^. root <^&piex 


3'bllO 


f^hjiinate ll^eceiver 


3'blll 






port_is_downstream - lets^.<^Ad^ress\opkup Module know what the 
most likely lookup sequence sh^l<fb6^j(rbuted to root complex). 
tc[2:0] - used to help;d^^fce for this transaction. 



osd[3:0] - this is ^edgn conji|^cti0n with the tc[2:0] field to determine 
theegress_qid[3% ^ | 



Fast Path 

If the Addres^*^ol^^IFO is empty, the routing information is immediately 
presente^^j^^^ A^dr^^Lookup Module in the Switch Core. 



The^ddre^|.d'@(kup Module(ALM) figures out the root port and which ports are 
^pn^cted to%iS 'iiigress port. Then it begins walking the entries (up to 16) in t he 

s_r^ii^er_p)okup_table to find the base/limit pair that matches the address 
range, based^n the array of valid ports to search. Note that the root entry is 
searched first if the 'portjs^downstream' bit is set. The ALM also determines 
|%^lss QID number (ranging from 0 to 15). If the ALM determines that a 
particular configuration request is targeted for the COP, it will notify the Address 
Lookup Interface that the TLP configuration type should be changed from type 1 
to type 0 when the TLP is forwarded to the Data Mover. 

2.3.3 Message Requests 

Each type of Message Request behaves differently regarding data flow. The following 
sections describe the various types of Message Requests and how it differs from the data 
flow example in Section 2.3. 1 . 
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2.3.3.1.1 INTx Interrupt Signaling Message Requests 

The INTx virtual wire interrupt signaling mechanism is used to support legacy Endpoints 
in cases where the Message Signaled Interrupt mechanism cannot be used. All INTx 
messages are routed to the Root Complex. 



INTx Messages follow the same steps in the previous section except for steps 6 and 7 

would be replaced with: 

6. The Rx TLM stores the header address of each TLP in the Address Lookup FIFO. 
When the Address Lookup Interface is ready to present a transaction to the 
Address Lookup Module in the Switch Core, it uses the header ad^fess stored in 
the Address Lookup FIFO to access the routing information (in tfn^ase, the\bus 
number) from the Header Buffer. The following informatiorys pass^ffito tj 
Lookup Module: 

• address[43:0] - contains the 8-bit bus number.^^^^ \^*^^'"'^^^ 

• lookup type[2:0] - field is used to specify th^trat^^ctioVtype as shown 
in the following table. The types relativej:^iis tran^^ion type are 



Lookup type[2:01 


Ir»i.!$action^$fiiiition 


3'bOOO 




3'bOOl J, 


Mv^itfflOTifery transaction 


3'bOlO 


y I^^ased transaction 


3'bOll ^"'f 


l^f " 3*|-b^'t I/O transaction 








1 Broadcast from root complex 




1 Terminate at receiver 




Reserved 




A., 

port is^hjwil^^am - iSts the Address Lookup Module know what the 
m^l^ikej^^igolllpvsequence should be (routed to root complex). 
tci2 fl^^Sl3||Q help determine the egress_qid[3:0] for this transaction. 
lll^Q]}^ this is used in conjunction with the tc[2:0] field to determine 
the^Eess|^id[3:0]. 



If thejAddress Lookup FIFO is empty, the routing information is immediately 
anted to the Address Lookup Module in the Switch Core. 



In this case, the Address Lookup Module(ALM) knows that the TLP is going to a 
Root Complex, so it only has to figure out the root port. Then it uses the device 
number to determine the mapping of the INTx virtual wire on primary side of 
bridge. The ALM returns int_map[l :0] to the Address Lookup Interface which 
stores it in the Header Info Code in the Header Buffer so that the Packet 
Generator will know to overwrite the Code in the TLP with the Code provided in 
the Header Info Code. 
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2.3.3. L2 Power Management Message Requests 

There are two Power Management Messages that require special handling in the Address 
Lookup Interface in the Rx TLM. They are PME_Tum_Off and PME_TO_Ack. 



2.3.3.1.3 PME_Turn_Off 

PME_Turn_Off is generated by a root complex to notify all of its downstream ports to 
prepare for power removal. PME_Tum_Off Message has a routing type of 3'blOl which 
is 'broadcast from root complex'. In this case, steps 6 and 7 are replaced by the 
following: 

6, The Rx TLM stores the header address of each TLP in the AddressWookup FIFO. 
When the Address Lookup Interface is ready to present a trar^ction ^the ^ 
Address Lookup Module in the Switch Core, it uses the heade^ail?l^^s^rcd in 
the Address Lookup FIFO to access the routing informa|ion (in'^tbis the bus 
number) from the Header Buffer. The following inf(^fen^<t>n is lijassed into the 
Lookup Module: ^^^s ^ 

• address[43 :0] - contains the 8-bit bus ^mben^ r 

• lookup_type[2:0] - field is used to specr^the tr^^action type as shown 
in the following table. The types€aa|^^t^]iis transaction type are 



Lookup type[2:ft[C 


'k'l'i'^q^^ction definition 


3'bOOO _ ^ 


32-{yt^fiiemory transaction 




%l6»i^6it memory transaction 


3'bOl% ^ 


1 F ID-based transaction 




f 32-bit I/O transaction 




Routed to root complex 






'X ^biio - 


Terminate at receiver 


... XB^iy-i 


Reserved 




most 




js^^^wnstream - lets the Address Lookup Module know what the 
e^ookup sequence should be (routed to root complex). 
|c[2:0|- used to help determine the egress_qid[3:0] for this transaction. 

idP'U)] - this is used in conjunction with the tc[2:0] field to determine 
|he egress_qid[3:0]. 



Fast Path 

If the Address Lookup FIFO is empty, the routing information is immediately 
presented to the Address Lookup Module in the Switch Core. 



In this case, the Address Lookup Module(ALM) knows that the TLP needs to be 
broadcast to all downstream ports configured to the Root Complex, so it looks up 
the endpoints that the TLP should be sent to by asserting the corresponding bits in 
broadcast_ports[l 5:0]. The Address Lookup Interface then submits a TLP to the 
Transaction Scheduler for each downstream port designated in 
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broadcast_ports[15:0], along with sending it to the cop (port 16). The Address 
Lookup Interface will then halt all forward progress until it has been notified by 
the COP that the scoreboard is set up and it is ready to receive all the 
PME_Tum_Off Messages from each downstream port. 



2.3.3.1.4 PME_TO_Ack 

The only thing to note for PME_TO_Ack Messages is that when an upstream port sends a 
PME_TO_Ack message, they should all be routed to the COP. The COP will keep a 
scoreboard of all the endpoint ports that received a PME_Tum_Off Message as they 
transmit a PME_TO_Ack message. When all of the endpoint ports have^'^^mitted^a 
PME_TO_Ack message (or the timer expires), the COP will generateone PM?|4,T0J\ck 
Message to the Root Complex. 





2.3.3.1.5 Locked Transaction Message Requests 

Whenever a particular root requests a locked transaction^^L)ther ^i|rce|^going to that 
output will be halted. When the CplLk is received froril^h^^M^nstreipn port, all other 
upstream queues going to the root are locked until the %]^lock nl^sage is received. 



2.4 Shared Link Description^ 



2.4.1 AS Encapsulation 

AS header encapsulation only^p'fj^ainl 
knowledge of "Shared I/0':*^nT 

AS Header from all TLgs^eforg^ 

Header to determine th^O^^main 




|o Shafed Ports. Non-shared Ports have no 
^C, the Transaction Layer Module strips the 
^^fie packet in the in-line buffer. It uses the AS 
Tsociated with the given TLP. On the Tx MAC, 
the Transaction Layer N^6^1e%sejts the AS Header to the given TLP. The OS Domain 
that is inserted as^p^?o#^^AS^eader is reported by the Switch Core and passed on to 
the Tx Tran|^Gti^Lay^er M^llile. 



The AS^e^er is d^cril^ed in detail in the follow section. Figure 2.4-2 describes the 
forn#ofth% 



.Healer. 



2.4.2 ^a^OTomaln Routing 

For our switch, ports can be shareable ports, which means multiple different CPUs can 
address resources over the same PCI-Express link. A maximum of 16 OS Domains (or 
CPUs) will be supported in this implementation, with each port having the capability to 
send and receive from 16 OS Domains, across ail 8 VCs possible in the PCI Express 
spec. 



The PCI-Express Advanced Switching (AS) spec incorporates a 8-byte AS header that is 
inserted into the transaction. Our switch will use this header to specify the OS Domain 
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ByteO 



Byte 4 



associated with a given transaction. The following diagram shows where this header is 
located in the packet. 



■lacaminai 




gjggugrjjgg 










m 






IbKarritn'al 




iNarnBea 















Transaction Layer 
~ Data Layer 
Physical Layer 




Figure 2.4-1: 

AS headers specify an 8-bit field, called the PI (Protocol Inter|a6e) fieT^,^ 
what type of AS packet is contained in the payload. We're^E<?^^^se number 
ranging from 224 through 254 as a vendor-defined PI. pi^only ot^pilffe of 
information in our AS header will be the 8-bit OS Do«iiyi1%v The imposed AS header 
is shown below. 



+0 

7| 6| 5| 4| 3 1 2| 1 1 0 7 


6 


+1 ■ j» 
5|4|3|2| IM 




«^5ui3|2.|l|0 


+3 

7|6|5|4|3|2|l|0 


R 


R 
N 
P 


RN "'^^ 


^) 


T OSD 


PI 


A ^ 1 R 




Fmfe 2.4-2: 



PI - Protocol Identifier^ffieli^lm AS 
OSD - OS Domain nukl^r ^ 
RNP - Resource^^^^P^sen%(when high, the RN field is valid., when low, it is 
invalid and must be*^t f<3%lH)^s) 

RN - Resoj|fce NtabeWwhich buffer this packet belongs to) 
R- reserve! 

On ^ared port^lhe^^ MAC will first check the PI field to make sure this type of AS 
paclc&is unde^^tood by our switch. We'll have a register within the RX MAC defined 
that coi^ms^the allowable encoding for this type of AS packet. At first, this will likely 
be our selected vendor-defined PI number. If our technology is adopted by the SIG, a 
standard number will be defined, and this register will then be programmed to contain 
this standard value for use with future devices. 



This AS header allows our switch to map the incoming value of the OS Domain field 
(local to the I/O device or inter-switch port) in the AS header to the global OS Domain 
number within the switch (one of 16 values, 0 through 15 for the first revision of our 
switch). Any packets received on the shared port that use a different PI type will be 
discarded. 
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For upstream ports, the RX MAC will only need to associate the OSD number with a 4- 
bit register number loaded at config time. This number will be the OSD field that is 
passed to the lookup unit to match the correct set of P2P bridge headers to compare for a 
lookup match. 

2.4.3 Flow Control and Credits 



Flow Control is used to track the queue/buffer space available in the Agent across the 
link. It is used to prevent overflow of receiver buffers. The Flow Controynformation is 
conveyed between two sides of the Link using DLLP packets. 

For Non-shared Endpoints (only one OSD), flow control updates folffvC^^be samWormat 



as the PCI Express Base Specification, which is as follows: 



+0 



7|6|5|4|3|2|l|0 



+1 



7|6 | 5|4|3|2|l|0 



7 I 6 I iif my I 2 I 




7 I 6 I 5 I 4 I 3 I 2 I 1 I 0 



ByteO 
Byte 1 



P/NP/Cpl 



VCID 



HdrFC 



DataFC 



ByteO 
Byte 4 



16b CRC 



Figure 2.4-3: DLLP format for Flow ^^^Pa^tt^or Non-Shared Ports 

P/NP/Cpl: This field specifies the type of^ansaction^hat is being reported. P - Posted 

Request; NP - Non-posted Request; Cpl - S^mjffetion!V 

VC ID: This field specifies the Virtya^F^^nllHthat^ being reported. 

R: Reserved ^ 

HdrFC: This field contains the^gfedr^lue fojl Headers of the indicated type. One credit 
value for headers is one ma^pji^^size%^4^ plus TLP digest. 

DataFC: This field contj^ th^crel^ai^ for pay load data of the indicated type. One 
credit value is equiva^At^|g/t)^tes ^^ata. 

16bCRC: This field cont^ins^e calculated CRC value of all bits of the packet using the 
polynomial coeffiS^^^lf^ 

control updates are advertised using the following DLLP to 



For Shared^ndpOiljs, Spv/ 
accountjbnihe OSD% ^ 




1 +0 ^"^^^ 
7|6|5M3 I2I1I0 


+1 

7 1 6| 5| 4| 3 


2| 1 |0 


7 1 6 1 5 


+2 
4 


• 

3 1 2 1 l| 0 


+3 

7 1 6| 5| 4| 3 1 2 1 ll 0 


1011 OV2V1V0 


TT 


R 


OSD 


C 
T 




Credit count 


16bCRC 





Figure 2.4-4: DLLP format for Flow Control Packet for Shared Ports 



• Type: Upper nibble set to 101 1 for an FC Update shared-link DLLP. The lower 
nibble specifies the VC number. 

• TT: Transaction Type (00 for Posted, 01 for Non-posted, 10 for Completions) 

• R: Reserved 
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• OSD: OSD number 

• CT: Credit Type (0 for header credits, 1 for data credits 

• Credit count: Contains either the 12-bit data credit count or the 8-bit (upper 4 bits 
are zeros) header credit count, based on the value of the CT bit 

• 1 6bCRC: This field contains the calculated CRC value of all bits of the packet 
using the polynomial coefficient of lOOBh. 

2.4.3.1 J Receiver Flow Control 

The RxTLM keeps track of flow control accounting functions of its buffers. This 
information is assembled into a FCUpdate DLLP and forwarded to the T^^ransaction 
Layer Module where it is scheduled to be transmitted to the Agent across tr 



For each type of information tracked, the following quantities are callpli^d for ^mv 
control TLP Receiver accounting (for non-shared ports, these.^^culati^ra^^'^^'erformed 
for each VC, and for shared ports these calculations are per|d^n6%|or ea^h VC/OSD 
group): 




CREDITS^ALLOCATED - The total number X^credit^anted to the 
Transmitter since initialization, modulo^^^^^^here {Field Size] is 8 for 
headers and 12 for payload data). 
CREDITS_RECEIVED - The totji^um^r oF^^C,units consumed by valid TLPs 
received since flow control initiaUzlifej^ niodyll'2^^'^^'^^'^^^ (where [Field Size! is 
8 for headers and 12 for ^diy)0i 




The RxTLM will also check fp^uffetJ^i^^erriifis. This is done by checking the following 
equation: j^k 

(CREDITS_ALL0CA4^ED%feDITi^RECEIVED) modulo 2^^*"^^ ^'^^ > 2^''**^''* 12 

The scheduling of^^^^^ ^UpdateFC DLLPs will obey the following rules: 

If tl^^^^^nl^-LO or LOs Link state, UpdateFC DLLPs will be scheduled for 
^^ssion <|ice every 30us or 120us, depending on the status of the Extended 
l^c ^in tM Control Link Register. 

Timer will also be implemented with the following rules: 

■ The Timer is active only when the Link is in the LO or LOs Link 
state. 

■ The Timer has a limit of 200us. 

■ The receipt of any Init or Update FC DLLP resets the Timer. 

■ Upon Timer expiration, the Physical Layer will be instructed to 
retrain the Link. 

Otherwise, for all types of transactions that do not have infinite credit, a Flow Control 
DLLP will be scheduled for transmission after a valid TLP is received and stored, or 
when one unit is made available by TLPs processed. 
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2.4.3.1.2 Transmitter Flow Control 

The Transaction Scheduler receives the latest FC updates from all the Tx MACs. These 
FC updates report the most recent number of FC units advertised by the receiver on the 
other side of the link called CREDIT_LIMIT. CREDIT^LIMIT is used to determine if 
the transactions being transferred to a particular Tx MAC has enough FC credits to be 
forwarded to the appropriate Tx MAC. 

For each type of information tracked, the following quantities are calculated for flow 
control gating: 



CREDITS_CONSUMED - The total number of FC units consumeSk^ TLP v 
transmissions made since flow control initialization, modulo^'^*^ ^*^%y.her# 
[Field Size] is 8 for headers and 12 for payload data), '1 



CREDIT_LIMIT - The most recent number of FC unit^gally%d^f^§€d by the 
Receiver. This quantity represents the total numbe^fif ©Igredit^made available 
by the Receiver since flow control initialization, jr^ulo 2'^^'^$twhere [Field 
Size] is 8 for headers and 12 for payload data) 



To determine if there is enough credit for the cun 
is evaluated: 



:tion,^e following equation 



(CREDIT^LIMIT - (CREDITS_CONSUMt%^ ci^rr,^^^^^ modulo 

2 [Field Size] ^ 2[^i^l^ Size] — ^ 



Even though the Transactio|ifSclfe^^erl^i€iines Flow Control gating of transactions, it 
does not have any knowl^ge of the^^^^tatus of the transaction. It is not until the 
Transaction Scheduler^n^^^Ehe TpP to the Tx MAC that it is known if the packet is 
in error. Therefore, the\x N^C is responsible for notifying the Transaction Scheduler 
that the current Ti^^^^sp[hat it does not affect CREDITS_CONSUMED. 



2.4.4 

The^ arelXv^^^^^f Reset on the chip - Fundamental Reset and Hot Reset. The 
follqwing diagram will be used throughout the document to describe the devices affected 
wher^^pe ojReset is asserted at a Root Complex, a Root Port, the Switch, a 
Downs^fitA'^ort, or an Endpoint (I/O device) attached to a Downstream Port. Ports 1 , 2, 
3, 9 and 10 are all attached to root complexes and therefore represent one OS domain (for 
simplicity, the OS domain number will directly correlate to the port number, i.e. port 1 is 
assigned OS domain 1 in this example). Switch #1 has been configured such that port 4 
is shared by OS domains 1 and 3, port 5 is only accessed by OS Domain 2, port 6 is 
shared by OS domains 1,2, and 3, and port 7 is shared by OS domains 1 and 2. 
Downstream port 4 in Switch #1 is connected to Root Port 8 in Switch #2, which enables 
access to the endpoints on Switch #2 from the Root Complexes in Switch #1. 



NextIO, Inc. Confidential 
Property of NextIO, Inc. 



Page 22 of 222 



NextIO, Inc. 

©All Rights Reserved. 



NEXSIS Overview Document 
V0.8 



Switch #1 



Root 

P0rt_Qj 



Root 
Portji] 



Root Complex 




Root Complex 




Root Complex 


















PI 


P2 


P3 





Root 
Port 



P4 






P5 




P6 


P7 




Downst 

POf 


ream 
1 


1 

3 




Downs 

Po 


treamL^ 
rt 


Downst 
Po 


ream 

1 


1 

2 
3 


Downst 
Po 


reamra 










, 1 


1^ 























Endpoint 




re 2.4-5: Example Topology 



2.4.4.1.1 



^ndan^tal Reset 

Fundamgnt^|Reset i^an auxiliary signal provided by the system to a component or add- 
in c^. the -^l^^raiist be called PERST#. Fundamental Reset can be asserted on the 
Roo^omplexpH^ Endpoints (I/O Devices), or the Switch. 

When PSfidaffiental Reset is asserted on the Endpoints or the Switch, the behavior is 
identical to what is described in the PCI Express Base Specification - all of the Links 
attached to the device being reset will be retrained, the state machines will be initialized, 
and all TLP information will be flushed. 



If Fundamental Reset is asserted on a Root Complex, not only does the Root Complex get 
reset, but all of its downstream ports must reset as well. If its downstream ports are 
"shared" with other Root Complexes, it is important to be able to reset only the part of 
the downstream port that pertains to the Root Complex being reset, and to leave the rest 
of the downstream port logic unchanged. This is done by transmitting and generating 
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vendor-specific DLLPs called Reset DLLPs that informs the two components on the link 
which OS Domain is getting reset (this is explained further in the following sections). If 
the downstream port is not shared by Root Complexes, the Link will reset according to 
the PCI Express Base Specification. The following sections describe how the chip 
behaves when a Fundamental Reset is asserted on the various parts of the chip. 



2.4.4.1.2 Fundamental Reset initiated at the Switch 

If the Fundamental Reset on the Switch is asserted, it will propagate the Reset to all 
upstream and downstream ports. The devices attached to all the ports will be reset 
according to the PCI Express Base Specification - all of the Links attachedi-to the device 
being reset will be retrained, the state machines will be initialized, and aH^P v 
information will be flushed. If the fabric topology involves more than one , an d a 

Root Port in another Switch is affected by the reset, then all of the d^rtsJjeam^j^C 
assigned to that Root Port are also reset. The components higiy|ghted^n^^'(^m Figure 
2.4-6 depict the components that are affected when Switch ffl[^1^^et. feveryming inside 
Switch #1 is reset, along with the devices attached to the^OT:s. N^yha^n Switch #2, 
only one of the root ports is affected by the reset. That"tcenl|ip is exj^Sined in detail in 
Section 2.4.4.1.3. 
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Figure 2.4-6: Fundamental Reset initiated at Switch #1 

2.4.4.1.3 Fundamental Reset initiated at the Root Complex 

If the Fundamental Reset on a Root Complex is asserted, the reset must be propagated to 
all its downstream ports without corrupting the traffic from the other OS Domains (or 
Root Complexes). The following steps are taken when a Fundamental Reset is asserted 
at a Root Complex: 




At the Root Complex 

1 . All Port Registers and State Machines must be set to their initial values as 
specified in the PCI Express Base Document. 

2. The Root Complex will attempt to retrain the link 

3. Once both components on the link have entered the initial linfe^ining slltefthey 
will proceed through Link Initialization and then through Flo 
Initialization for VCO. ^ 

At the Switch ^ ^4 

1. 



The Root Port connected to the Root Complex^v^lll^etoin anolnitialize all its 
state machines and registers. ^ ^^k^ 

The Root Port will notify the COP that all of its a^nstr^m ports need to be 
reset. ^^^^^ 
The COP will pass the reset notific^on to alKthe downstream Ports. 
All registers and state machines rd^^T^evthe^.^D being reset must be set to 
their initial values. All TLPs ^l@i|^in^p theI)SD being reset must be flushed - 
all TLPs stored in the Rx ij^ne bui^s wi|J%aturally drain and all TLPs in the 
Tx retry buffers will drmn aft^Ack M.LP's are received. During reset, new TLPs 
belonging to the OSp^r^ ^^^^ ^" ^^^^ preserved. 
The Downstream B^rt will^aiodl^ of the reset condition and the OSD that has 



2. 

3. 
4. 




5. 

initiated the reset?^ 

6a. If the Downstrei^n Poi^ not shared (i.e. it is only accessed by one Root 

Complex^^^^^rlis reM^by attempting to retrain the Link. 
6b. If the Dowii^a#B8ilis shared, all TLPs pertaining to the OSD that has 

initi^ted'flt^esetmust be flushed. Flow Control must also be updated to reflect 
the |ushin^kT^s. A vendor-specific DLLP is generated called a Reset DLLP 
ya^Oown^eam Port. The Reset DLLP contains the OSD that initiated the 
reset S^is^transmitted on the link. A Reset DLLP is transmitted every time the 
Transaction Arbiter selects the OSD that initiated the reset. Otherwise, the 
ransfltion Arbiter schedules TLPs to transmit on the other OSDs that are 
operating normally. The Downstream Port will continue to transmit Reset DLLPs 
until the reset notification from the COP has been removed. 



At the Endpoint 

la. If the Endpoint is not shared (i.e. it is only accessed by one Root Complex), the 

port is reset by attempting to retrain the Link, 
lb. If the Endpoint is shared, it will receive the Reset DLLPs and clear all registers, 

state machines, and flush all TLPs that pertain to the OSD initiating the reset. 
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The device will stay in this reset mode until it stops receiving Reset DLLPs. All 
traffic related to the OSDs that are not being reset will operate normally. 



The components highlighted in yellow in Figure 2.4-7 depict the components that are 
affected when the Root Complex attached to Root Port #1 is reset. Since downstream 
ports 4,6, and 7 are shared ports, only the logic pertaining to OS domain 1 should be 
affected by the reset. On the other hand, port 14 in Switch #2 is only accessed by the OS 
Domain being reset and can therefore reset the entire port by retraining the Link (instead 
of sending Reset DLLPs specific to an OS domain). 
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Figure 2.4-7 : Fundamental Reset initiated at a Root Complex 



2.4.4.1.4 Fundamental Reset initiated at an Endpoint 

If the Fundamental Reset on an endpoint device is asserted, the device will simply reset 
with its link according to the PCI Express Base Specification - all of the Links attached to 
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the device being reset will be retrained, the state machines will be initialized, and ail TLP 
information will be flushed. All other Ports on the Switch will not be affected. 



The components highlighted in yellow in Figure 2.4-7 depict the components that are 
affected when the endpoint connected to downstream port 13 is reset. It simply attempts 
to retrain with the port on the other side of the Link. Any transaction being received by 
the Tx MAC from the Switch Core will be discarded and will never reach the Data Link 
Layer Module. The Root Complex will eventually time out when it never receives a 
completion for a particular request. (We could also let the COP generate UR completions 
since it will know which endpoints are in reset. The ALM could keep tracl^* of which 
transactions are going to egress ports that are in reset and then route the pf 
COP.) 
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Figure 2.4-8 : Fundamental Reset initiated at an Endpoint 
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2.4.4.1.5 Hot Reset 

Hot Reset is an in-band mechanism for propagating reset across a link. A Link can enter 
Hot Reset if directed by a higher layer, or if it receives two consecutive TSl ordered sets 
with the Hot Reset bit asserted. The following sections describe how the chip behaves 
when a Hot Reset is asserted on the various parts of the chip. 

2.4.4.1.6 Hot Reset initiated at the Root Port 

A Root Port can enter Hot Reset by having its Secondary Bus Reset bit seMn the Bridge 
Control Register, or by receiving two consecutive TSls with the Hot Reser|it set on\the 
Link. If the Hot Reset on a Root Port is initiated, the reset must be pn|)pagatel% all A 
downstream ports without corrupting the traffic from the other OS do^^BS (or 



Complexes). The following steps are taken when a Hot Reset^^initiated^taT^dot Port 
At the Root Complex 



2. 



3. 






Res^tit asserted and 



The Root Complex receives a TSl sequence w^itli thel 
will attempt to retrain the Link. This can happ 
set in the P2P config space. 
All Port Registers and State Machin^tQusf^en 
specified in the PCI Express Base^Mcument 
Once both components on the lin^fi%^mred initial link training state, they 
will proceed through Link Mtia^io^^d tibcn through Flow Control 
Initialization for VCO. ^ 



mdary bus reset bit is 



eir initial values as 



At the Switch 

1 . The Root Port conn^ted"^ 
state machines aij^egi^ers? 

2. The Upstream^,^?%;U;& 
reset. 



^fiComplex will retrain and initialize all its 
! COP that all of its downstream ports need to be 



3. 
4. 



The COE^^tjtos^ke re^t notification to all the downstream Ports. 



All r§gi^^g'^d stai^teachines relevant to the OSD being reset must be set to 
thei|1nitiafl|^lu^^/Vll TLPs belonging to the OSD being reset must be flushed - 
11 T^s sto^ iinhe Rx in-line buffers will naturally drain and all TLPs in the 

r^kbuf&rs will drain after Ack DLLPs are received. During reset, new TLPs 
belongirffTOe OSD will be rejected. All other TLPs will be preserved. 
The Downstream Port will be notified of the reset condition and the OSD that has 
l^^glteS the reset. 

6a. If the Downstream Port is not shared (i.e. it is only accessed by one Root 
Complex), the port is reset by attempting to retrain the Link. 

6b. If the Downstream Port is shared, all TLPs pertaining to the OSD that has 

initiated the reset must be flushed. Flow Control must also be updated to reflect 
the flushing of TLPs. A vendor-specific DLLP is generated called a Reset DLLP 
by the Downstream Port. The Reset DLLP contains the OSD that initiated the 
reset and is transmitted on the link. A Reset DLLP is transmitted every time the 
Transaction Arbiter selects the OSD that initiated the reset. Otherwise, the 
Transaction Arbiter schedules TLPs to transmit on the other OSDs that are 
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operating normally. The Downstream Port will continue to transmit Reset DLLPs 
until the reset notification from the COP has been removed. 
At the Endpoint 

la. If the Endpoint is not shared (i.e. it is only accessed by one Root Complex), the 

port is reset by attempting to retrain the Link, 
lb. If the Endpoint is shared, it will receive the Reset DLLPs and clear all registers, 

state machines, and flush all TLPs that pertain to the OSD initiating the reset. 

The device will stay in this reset mode until it stops receiving Reset DLLPs. All 

traffic related to the OSDs that are not being reset will operate normally. 

2.4.4.1. 7 Hot Reset initiated at the COP 

A Hot Reset can be initiated at the Switch through a set of Registersjhat canl^ccefsed 
via an I2C interface. The Hot Reset can be programmed such that it (Safl^erat^M'^ per 
port and/or per OSD basis. If the Hot Reset is propagated to P^s wit^^lS^nguishing 
between OSDs, the Ports will be reset according to the PCI^^pflss Ba^ Specification - 
all of the Links attached to the device being reset will be j^rained^e s^te machines 
will be initialized, and all TLP information will be fli^H|d?^J^the H^Reset is 
propagated to a subset of the OS domains on a part idllar Port, ^^Port will use Reset 
DLLPs to reset the designated part of its port lo^.p^ertoitting to Pe OS domain specified 
in the Reset DLLP. "" "^^ 

2.4.4.1.8 Hot Reset initiated at a Doy^^m Pohf 

A Downstream Port can enter Hot Rf^SP^|ia\^g ijrSecondary Bus Reset bit set in the 
Bridge Control Register. The Poct^ll tra^it I^set DLLPs on the OSDs that 
correspond to the Secondary ^^^efl^bits tl^t were asserted and clear all Register and 
State Machines pertaining tcj^he^^^. ^|^t|lffic related to the OSDs that are not being 
reset will operate normajjv^ 



2.4.4.1.9 Hot Reset inU^aiFd^^ a Shared Upstream Port 

A Shared Upstreml^SiE^^nllf Hot Reset by having its Secondary Bus Reset bit set in 
its Bridge (^nt^LR^gister.Hme Port will transmit Reset DLLPs on the OSDs that 
correspond|o the'^^on^y Bus Reset bits that were asserted and clear all Register and 
State N|^h^s pert^ing to the OSD, All traffic related to the OSDs that are not being 
resettwiil op^^^opmally. 

2. 4. ^ot Reset initiated at an I/O Device 

If a dev^S^^nts to reset itself, it can do so by either transmitting Reset DLLPs or by 
transmitting TSl s. If the device wishes to only reset a particular OSD, it will generate 
and transmit Reset DLLPs that specify the OSD to reset. It will also clear all Registers 
and State Machines pertaining to the OSD. If the device wishes to reset the entire link, it 
will generate and transmit TSl s with the Hot Reset bit asserted to reset the entire link. It 
will also initialize all Registers and State Machines and attempt to bring up the link. 

2.4.5 Power Managem nt 

PCI Express Power Management provides the following services: 
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It allows software driven D-state transitions to change the Link power 
management states for a physical Link. 

It provides a hardware-autonomous capability to change the Link power 
management states for a physical Link (Active State Power Management). 
It provides a wakeup mechanism driven by in-band TLPs routed from the 
requesting device towards the Root Complex - these are called power 
management event (PME) Messages. 

It provides a means to change the Power Management state by generating PMEs 
per PCI Express function. 



2.4.5.L1 Link State Power Management 

A PCI Express physical Link can enter Link power management states by eitftes^om)«^are 
driven D-state transitions or by active state Link power management k^^MS?- I^^fined 
Link states include LO, LOs, LI, L2, and L3. The power saving£increl§|j^s^^ Link 
state transitions from LO through L3. Table 2.4-1 and Tabl^c!!4%sumi^§rizek the Link 
Power Management States for both non-shared I/O and sj^d I/o5 





L-State 
Description 


SAW PH? 


"^^JUsed 1^ 


Clocks & 
Power 


LO 


Fully Active Link . 


4^Yes>^pO)\ 


Yes (DO) 


On 


LOs 


Standby State 




'' Yes (DO) 


On 


LI 


Lower Powe'P** 
Standbm. 




No 


On 


L2/L3 Ready 


Staging^^ntB(|j_ 
Powet^Re^vaf'^^ 


|Yes 


No 


On 


L3 


.4^. Qff 


n/a 


n/a 


Off 



Table 2.4-1: Summai^of Non-Shared I/O Link Power Management States 






1 — % ^ — ^ 

^ %L-Stat^ ^ 

*^ De^riptipn 


Used by 
S/W PM? 


Used by 
ASPM? 


Clocks & 
Power 


Shared I/O 


LO 1 


Fulf^^ive 
,|Link 


Yes (DO) 


Yes (DO) 


On 


Normal Operation 


LOs 


^t^dby State 


No 


Yes (DO) 


On 


TLP & DLLP transmission is 
prohibited for all 
OSDs(ASPMonly) 


L1-L3 


Lower Power 
States 


No 


No 


On 


TLP & DLLP transmission is 
prohibited for a specific OSD 



Table 2.4-2: Summary of Shared I/O Link Power Management States 
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2.4.5.1.2 Power Management Software Control 

One of the ways that power management states of a Link are determined is by the 
software driven D-state of its downstream component. Table 2.4-3 depicts the 
relationship between the power state of a component and its Upstream Link. A 
Downstream component can be an Endpoint or another Switch. 



Downstream 
Component D-State 


Permissible Upstream 
Component D-State 


Permissible 
Intercormect State 
(for Non-Shared I/O) 


Permissible Interconnect 
State 
(for Shared I/O) 


DO 


DO 


LO, LOs 


y LO, LOs 


Dl (optional) 


DO-Dl 


LI 


LO/fi!|^(Cannot go into 
a low l^fel pow^f. state) 


D2 (optional) 


D0-D2 


LI 


%^LOs (8%adf go into 
%Tot5i|ei/^l^powe^ state) 


D3ho. 


D0-D3ho. 


LI, L2/L3 ^^^^^^ 


LSi^tOs^'ICannot go into 
a lo^Jevel power state) 


D3coid 


D0-D3coid 


\. 


^%O^E()s (Cannot go into 
Wow level power state) 



Table 2.4-3: Relation between Power Management Stltgs of LInBand Components 
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upstream Component 



Downstream Component 



T D 



P j P D T 



CO) component sends^ — B'fl^^^ D D D ^ 



Lconfiguration write request 



DDDiQDI 



Downstream component" 
begins L1 transition process 

Downstream component 
blocks scheduling of new TLPs 



Upstream component blocks 
^ J scheduling of new TLPs 

I Upstream component receives 
[^acknowledgment for last TLP 

{Upstream component sends io 
PM_Request_Ack DLLP 




Upstream component sends 
PM_Request_Ack DLLP 
continuously until it sees 
electrical idle 




Downstream component waits 
to receive Ack for last TLP> 

PM_Enter_L1 DLLPs\ 
sent continuously J 



Downstream components waits 
for PM_Request_Ack DLLP. 
acknowledglsg^the 
PM Enter L ' 




Upstream component completes 
L1 transition', disables DLLP, 
TLP transmission and brings 
Phy Layer to electrical idle 



Dowri^eam compSoRents^sees 
PM_%equeskAck DLL^t^ables 
DLJiC. "TLP trar^^lssion and brings 
^^Phy La^^jplectrlcal idle 




T - Transaction 
D - Data Link 
P - Physical | 



fared I/O 
^Non-Shared I/O - ^ & ^ 




.4-9: Entry into LI Link State 



2,4.5J^^..^^^ St^e Power Management (ASPM) 

Active State P^l&f^anagement (ASPM) is an autonomous hardware based active state 
meclmDism th^ enables power saving even when the connected components are in the 
DO st^el^^/operational state). After a period of idle Link time, the ASPM mechanism 
engages in a Physical Layer protocol that places the idle Link into a lower power state. 
Once in the lower power state, transitions to the fully operative LO state are triggered by 
traffic appearing on either side of the Link. This feature may be disabled by software. 

Since ASPM is initiated by the link being in idle for a specified amount of time, the 
physical layer can be placed in a lower power state regardless of whether the component 
is shareable or not. When any traffic (regardless of OS domain) appears, the link is 
placed in the fully operative LO State. 
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2.4.5.1.4 Power Management Event Mechanisms 

2.4.5.1.5 Link Wakeup 

PCI Express components are permitted to wakeup the system using a wakeup mechanism 
followed by a power management event (PME) Message. These PME Messages are in- 
band TLPs routed from the requesting device towards the Root Complex. This PME 
mechanism is broken up into two tasks: 

• Reactivation (wakeup) of the associated resources 

• Sending a PME Message to the Root Complex 

The Link wakeup mechanisms provide a means of signaling the platform t^^-establish 
power and reference clocks to the components within its domain. Tl^e are ^^de|ped 
wakeup mechanisms: Beacon and WAKE#. The Beacon uses in-ban(^li^§lingp 
implement wakeup functionality. WAKE# is an input to the §^^tch, aJ^^fe|f6nse to 
WAKE# being asserted, the Switch must generate a Beacopfrtha^%{)ropHgatea to the 
Root Complex. >^ ^ 



The Switch must translate the wakeup mechanism appn^riatelj^ben some ports use the 
beacon mechanism and others use WAKE#. Th^^^^ill^keep^ scoreboard of the 
downstream ports wakeup states, and when ^U^)ieUownl#lWi ports of a specific Root 
Complex have been woken up, the COP wilf either sehd a beacon or WAKE# to the Root 
Port. xY 



Regardless of the wakeup mechaai^ usecv^ce^ Link has been re-activated and 
trained, the requesting agent th^pro^gates |f PM_PME message upstream to the Root 
Complex. 




2. 4. 5. 1. 6 PME Messagj^ 

PCI Express devi^^Sesg^be Notified before their reference clock and main power is 



removed so^t^^jre^ prepare for it. This is done as follows: 

L Be^re pov||f and^elocks are turned off, the Root Complex (or Downstream Port) 
a PM%Trfrn_Off message to all agents downstream to cease initiation of 
sequent PM_PME messages. 
Each agenxis required to respond with a PMEToAck TLP, which must 
germinate at the point of origin. 

3. %sse'/ch agent responds with a PME_To_Ack TLP, the TLP is received by the 
endpoint port and routed to the COP. When the COP receives PME_To_Ack 
TLPs for all of a particular Root Complex's downstream ports, the COP 
generates and sends a PME_To_Ack TLP to the Root Port. 

4. Once an endpoint port has sent the PME_To_Ack packet, it must then prepare for 
removal of power and clocks by initiating a transition to the L2/L3 Ready state. 

5. The Switch is responsible for making sure that the upstream port goes to L2/L3 
Ready state after all its downstream ports have entered L2/L3 Ready state. It 
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should not wait indefinitely for the PME To Ack packet, but should implement 
a timeout mechanism where it would assume that the PME_Tum_Off TLP was 
received after the timeout expired. 
6. The Power Delivery Manager must wait a minimum of 100ns. after the Root 
Complex transitioned to L2/L3 Ready before removing power and clocks. 



2.4.6 Switch initialization 

There are some basic events that must always happen in whenever our dSvi^^^power^ 
from a software/configuration point of view. 

1. I2C initialization 

2. Link training 

3. OSD negotiation 

4. System setup update (optional) 

5. FC initialization 

6. Pseudo-Device Discovery (optional) 

7. Device Discovery 



Each will be covered in more detail in the : 




2.4.6.1.1 12C initialization 



Ilo^^(ing sections. 

There will be hardware defaults fo^any of&ie r^^isters in our chip that should provide 
a functional chip at boot time.^^Aer^^, however, a few structures that must be 

:et:^oon||n^^ince the hardware has no idea what type of 



provisioned by the I2C inte; 
system is being created 



There will be a default bi; 



will set up the rQ^u^dj?! 
where to go fj 
provisione^witff 





i OSDs to port set up by the I2C. This means the I2C 
bles so that transactions that enter the switch know 
e va^iouTO^SDs on that port. The following structures must be 
tefa^lf^values by the I2C for the switch to boot: 



TCW^^nappaiig table (indexed by osd[3:0]) in the MAC: set to all Os except for 
TCO whiSfC hardwired to 1 on VCO. Note that there is one of these tables per 

30rt. J 

Isjanation QID RAM in the Address_lookup_module (indexed by 
{dest_port[4:0],osd[3:0],tc[2:0]}: All entries for dest_port=16 should be 
provisioned to return some 5-bit number as the dest qid for all OSD/Tc 
combinations the system expects to appear. Again, the valid bit should be set for 
these entries. There is only one of these in the switch. 



For example, if the EEPROM expects a 2-OSD device to be plugged into port 8, and this 
device should talk to ports 0 and 1, the I2C will write the data in the Destination QID 
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RAM at address {5'dl6,4'd0,3'd0} to a value of 0. It will also write the address at 
{5'dl6,4'dl,3'd0} with a value of 1. By writing these values (and setting the valid bit for 
each address), Initial transactions will be queued in the switch core and eventually routed 
to the COP through the correct queue groups, 

2.4.6*L2 Link training 

This protocol should follow the base specification mechanism to allow a given port to 
train as a Ix, 2x, 4x, or 8x link. Since our switch is actually composed of 8x and 4x PCl- 
Express cores (an 8x core plus a 4x core will be contained in a MAC), the 8x core in each 
MAC will attempt to train first. If it trains in anything less than 8x, it'll "ton on" the 4x 
core and allow it to attempt to train. If it trains to 8x, the 4x core will ne< 
since all 8 lanes are in use by the 8x core. 




2.4.6. L3 OSD negotiation 

In order to minimize the required console management sotoaf^sage, Ijar device will 
support auto-negotiation of the number of OSDs that are^^sent oi^bgiven shared I/O 
port. The EEPROM will configure the allowable nurnb% oFi^SDs f^ach shared port 
and will be loaded as the default configuration of the^^em. ^^-^ 

Once our switch completes link training whenm^^^^%%iabled (due to a plug-in card 
being added or just coming out of reset in^e system^^h will begin the process of 
figuring out what types of devices it is conr^et^d^o on ^aTch port. A new procedure is 
defined to support this, using a new PBl^iiiat%sha4n here 
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OSD Cnt 




0000 0001 
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Byte 4 


LCI^ 







^^J^^^l^i^ 2.4-10: InitOSD DLLP fonnat 



Type - &lvv|^ys set 
PH^lt^^^SD 
R — Reserved 

VN %/ersion number (set to 00 for base OSD negotiation) 

OSD oll^^or InitOSD 1 DLLPs, the number of OSDs present in the device 

InitOSD2 lX,LPs, the negotiated number of OSDs for the link 



OOF 0001 for an OSD negotiation DLLP 
egotiation (0 for InitOSDl DLLPs, 1 for InitOSD2 DLLPs) 



for 



The protocol very closely resembles the base specification's method of FC initialization. 
For the "shared I/O base mode" negotiation, the VN field must be set to 00. This means 
that the OSD and VC will explicitly be used for all wireline communication between two 
devices. 



For "shared I/O extended mode", the VN field can be set to 01 which means. This mode 
means that the RN field will be used to explicitly map traffic onto buffer resources 
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between the link partners. This allow a larger number of OSDs to share a buffer if 
desired. 



1. 



2. 



3. 



The OSD state machine will begin transmitting the same packet, an InitOSDl 
DLLP, over and over every clock cycle until it receives an InitOSDl DLLP from 
the link partner's OSD state machine. At that point, the device will check the 
value of the "OSD Cnf ' field received from the link partner and compare that to 
the switch's number of OSDs (which was being advertised in the "OSD Cnt" field 
of the packets it initiates). The lesser of the two numbers will be the negotiated 
number of supported OSDs for that link. If the link partner never sends any 



InitOSDl DLLPs after a timer expires (3 us), it is a non-shared 
will proceed to normal flow control initialization. 
At that point, the OSD state machine will continue sending c 
DLLPs, but now it'll put the newly negotiated value "O: 
Once it receives a DLLP from the link partner with the sa^ne nu 




our swich 





yousI^pSDl 
the "OSD 



xcept ll|§ends it as an 



Cnt" field, the state machine moves on to step 3. 
The OSD state machine now transmits the same^flal 
InitOSD2 DLLP by setting the PH bit in the ©1^. Tl^yact tftat the state 
machine is now sending this type of DLLP means^hat it i|n3erstood the OSD 
negotiation procedure from its link partne^^ii^Bji^^also started at this point, 
and if the timer expires (3 us) beforp^'TnitO^^ is received from the link 
partner, the state machine is resetm^th^roc^fbegins again. If the state 
machine receives an InitOSD^Jt^m iS|^k j^^ner, OSD negotiation is complete 
and it stops transmitting In^^SDz^ 



2.4.6. L4 Shared resource initiai^i 

Once the number of OSI^^^^wn c^a link, shared resource initialization begins if the 
VN field was set to Ot dHrinp|SD negotiation. If the VN field was 00, this step is 
skipped. 



This mech^liism itews^e two devices to map multiple OSDA^Cs onto a common 
resource, o^uffer, i^desired. The results of the shared resource initialization will be 
store^ima^r^^l^r f^ software to read during system setup update. If any remapping of 
OSI^Cs to bp&Ffins required, it can be done at that point. 

This p%^^^erforms the same basic steps as OSD negotiation. If a link partner does 
not respond with InitRNl DLLPs within a 3 us time interval, it does not support shared 
resource initialization. 



+0 

7|6|5|4|3|2|l|0 
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6|5|4|3|2| 1 lO 
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Type - always set to 0000 0010 for a RN initialization DLLP 

PH - Phase of RN initialization (0 for InitRNl DLLPs, 1 for InitRN2 DLLPs) 

R - Reserved 

RN Cnt - Total number of shared buffers that can be used by the link pamet 

The step by step breakdown of the shared resource initialization protocol is shown below. 



1. 



The RN state machine will begin transmitting the same packet, an InitRNl DLLP, 
over and over every clock cycle until it receives an InitRNl DLLP from the link 
partner's RN state machine. > 
The RN state machine now set the PH bit so that the DLLPs are n'<^ynitRN2 
type. The fact that the RN state machine is now sending this^type ofBM.P means 
that it understood the RN initialization procedure from its linl^^ner.^^p^er is 
also started at this point, and if the timer expires (3 us) ^fore mi ^^t^I'^is 
received from the link partner, the state machine is restrai^d the^roc^ begins 
again. If the state machine receives an InitRN2 ^^ife l^l^artW, OSD 
negotiation is complete and it stops transmittin^pi^^l DLi' ^ 



If this step is skipped because a link partner doeyiot support it, f^^outing header will 

le used. 



always have the RNP bit set to 0 since the RKf^l^i: 



2.4.6.1.5 System setup update 

At this point, the number of OSDs (aj 
in the link partner is known by th^i|rdwai 
software can see the results. 




rhber of buffer resources) available 
alues are written to a register so that 



Based on the OSD negoti||ion.that B^blp[ed the allowable OSDs on that port, the 
switch will write to th^f6%^fBi^res(^^ registers based on the results of OSD 
negotiation. These regiWs \^||1 contain some fields that specify what the encodings are 
(internal to the S)!iiM|^t<w)inp^^ a particual OSDA^C. The EEPROM will have 
already set upj^jEaiS^rem^toounts for both the header and data buffers. 



For exai 
this 



if two%SDswere negotiated, the 16 registers might look something like 



Buff^esourc| 0 Register: OSD=0, VC=0, valid=l 
Buflfe^^j|pfe 1 Register: 0SD=1, VC=0, valid=l 
Buffer resource 2 Register: 0SD=x, VC=x, valid=0 




This means that ingressing TLPs on OSDOA^CO will use buffer resource 0 space since the 
link partner will set the RN=0 when it sets the OSD=0 in the AS header. Incoming TLPs 
on OSD WCO will have RN=1, 0SD=1 in the ExAS header. 

The hardware will now pause and query a control bit, halt_on_osd_complete, to 
determine what to do next. If that control bit is set (by the EEPROM during system 
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boot), the hardware will wait for software to clear it before preceding to FC initialization. 
If the bit is cleared (ie, wasn't set by the EEPROM programming), the hardware will 
proceed to FC initialization immediately. 

The purpose of this bit is to allow the system software/console management software an 
opportunity to take a look at what the results were of OSD negotiation. It can then go re- 
provision the buffer resources prior to FC initialization by reallocating space to different 
OSDs if necessary (and possibly VCs; this particular topic is covered in a sub-section, 
below). Note that the hardware makes no assumptions on the "correctness" of this 
reprogramming and will make no attempt to recover from invalid programming at this 
point. 

2.4.6. L6 Flow Control (FC) initialization 

If only 1 OSD is present as a result of OSD negotiation or OSEr%|otiaSon wlis skipped 
because the link partner is a non-shared device, the state n^hinewi^^^ the normal 
PCI-Express base specification FC initialization procediire^^weve^if more than one 
OSD is present, the state machine will begin "shared^BQ base^^initialization". If the 
shared resource initialization step was successfi^^^st^ejnachpe will begin "shared 
I/O extended FC initialization." There is also^janotfi^f^^^ caWed Buffer Retry 

Mode that is not implemented and is explajped fq sectipn 0. 




2.4.6.1. 7 Shared I/O Base FC Init 



A new DLLP is created to i 
here: 




th^^ initmlization information. This DLLP is shown 
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Byte 4 




Figure 2.4-11: InitFC-H/InitFC-D DLLP format 



Type -^^^^ibble set to 01 1 1 for an FC initialization shared-link DLLP. The lower 
nibble specifies the VC number. 

PH - Phase of FC Initialization (0 for InitFCl DLLPs, 1 for InitFC2 DLLPs) 
TT - transaction type (00 for Posted, 01 for Non-posted, 10 for Completions) 
R - Reserved 

OSD - OS Domain, ranging from 0 to 63 to specify the unique OSDs on the link 

CT - Credit type (0 for header credits, 1 for data credits)... this basicially identifies the 

DLLP as either an InitFCl^H or an InitFCl^D 

Credit count - contains either the 12-bit data credit count or the 8-bit (upper 4 bits are 
zeros) header credit count, based on the value of the CT bit 
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This section shows how shared-port FC initialization (where shared resource initialization 
was skipped) is performed. The pattern will closely resemble the base specification 
method of advertising normal FC. The state machine begins sending InitFCl_H and 
InitFCl_D DLLPs. It will send them in a repeating sequence as shown below as an 
example of a link that negotiated 2 OSDs. For this example, our switch was configured 
to only enable VCO on each OSD: 



lnitFCl_H (OSD = 0, VC = 0, Posted, header) 
InitFCl_D (OSD = 0, VC = 0, Posted, data) 
InitFCl_H (OSD = 0, VC = 0, Non-posted, header) 
InitFCl_D (OSD = 0, VC = 0, Non-posted, data) 
InitFCl_H (OSD = 0, VC = 0, Completion, header) 
InitFCl_D (OSD = 0, VC = 0, Completion, data) 
InitFCl_H (OSD = 1, VC = 0, Posted, header) . 
InitFCl_D (OSD = 1 , VC = 0, Posted, data) 
InitFCl_H (OSD = 1, VC = 0, Non-posted, header) 
InitFCl__D (OSD = 1, VC = 0, Non-posted, data) 
InitFCl_H (OSD = 1, VC = 0, Completion; header 
InitFCl_D (OSD = 1, VC = 0, Completion, di 




So for this example, 12 unique DLLI^^y^Uc^^ateJie credits for each OSDA^C 
enabled. Anytime more VCs are ^|bled usl^gl^rfiormal mechanism for PCI-Express, 



this procedure is run in the sam^ 



Since VCO should alway^'e enabl^ the^witch should "expect" to receive the same 12 
DLLPs from the link partm 



orresponding DLLPs have been received from the 




This pattern will^^ 

link parter. At thaf^lmif ^W4>yifch will begin sending the same sequence of DLLPs, 
except theyi^n^'Mni^^^ and InitFC2_D DLLPs (again, all 12 in a repeating 
sequence). |3nce th^wph receives an InitFC2 DLLP from its link partner, FC 
initiaji^tiof^compifete. Note that whenever the FCl phase is complete, TLPs can 
begii transmill^gs^fi the link. FC2 is just used to finally complete the handshake. 

The l^^^M^will contain the pre-calculated amount of credits to advertise and load 
these values into our internal registers at boot time. This can result in non-optimal buffer 
usage in the event the default provisioning is set up for a device that does not contain the 
same number of OSDs and/or VCs. The actual hardware defaults will assume all 16 
OSDs are enabled, all with VCO. As such, the credits will be equally split across all 16 
OSDs for all transaction types. 



2.4.6.1.8 Shared I/O extended FC ink 
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This section shows how shared-port FC initialization (where shared resource initialization 
was not skipped) is performed. The pattern will closely resemble the base specification 
method of advertising normal FC. The state machine begins sending InitFCl_H and 
InitFCl_D DLLPs. It will send them in a repeating sequence as shown below as an 
example of a link where we have 2 shared resources (RN = 2) as provisioned by the 
EEPROM: 



InitFC 1 _H (RN = 0, Posted, header) 
InitFCl__D (RN = 0, Posted, data) 
InitFC 1_H (RN = 0, Non-posted, header) 
InitFC 1_D (RN = 0, Non-posted, data) 
InitFC 1_H (RN = 0, Completion, header) 
InitFCl_D (RN - 0, Completion, data) 
InitFCl_H (RN = 1, Posted, header) 
InitFCl_D (RN = 1, Posted, data) 
InitFCl_H (RN = 1, Non-posted, header) 
InitFCl_D (RN = 1, Non-posted, data) 
InitFCl_H (RN = 1, Completion, header) 
InitFCl^D (RN = 1, Completion, data) 

2.4.6.L9 Buffer Retry Mode 

In Buffer Retry Mode, FC Initialization v(al|be^lrfom%d according to the way it is 
specified in the PCI Express Base Soe^ifeatio'^If thi port is shared, it will be 
transparent at this stage - FC Initij^^ation on a per VC basis. The 

partitioning of the buffer resou^s pe^OSD >^ithin each VC is determined during the 
configuration of the device-^^cilkregtSters/ 




In Buffer Retry Mode^jhe'iW^ is partitioned by VC as well as by OSD, and 

sets aside a "surplus'''amqui^frnemory that all types of packets can access regardless of 
OSD. The partitit^g,^th£m^M)^ for Buffer Retry Mode is shown in Figure 12 and 
the variables t^^ ppqgrarflfni^ are shown in Table 4. 

Retry Buffer Mode* 




P1_OSD(0)_RSVD_MEM 



P1 mrAL MEMK 

" J 



P2 TOTAL MEM< 



P1_QSD(1 )_RSVD_MEM 



P1_0SD(m-2)_RSVD_MEM 



P1_0SD(m-1 LRSVD_MEM 



P1_VC(1 LSURPLUS_MEM 



* In default mode, there are 
no further partitions beyond 
the segments shown above. 



Where: 

n = number of VCs 
m = number of OSDs 

•* In retry mode, the memory is 
partitioned as shown above. 
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Figure 12: Breakdown of the Data Buffer 



Since Flow Control DLLPs report credits solely by VC, and the Buffer Retry Mode 
breaks the buffer down even further into OSDs per VC, a packet could be transmitted that 
had enough flow control credit, but would still not get stored in the data buffer. This 
would happen if the packet belonged to an OSD that no longer had free space in its 
section of the data buffer, even though there appears to be credit according to flow 
control. 



Instead of dropping packets that could not get stored in the data buffer di^^ack of\ 
resources for a particular OSD, the Ack/Nak DLLPs are used to infonti the a^^endl)f 
the link to retry the packet. According to the PCI Express Base Specification, 
Ack/Nak DLLP is generated in the receive side of the Data Li^Layer^^^^^'^nsniitted 
to the other side of the link to inform the transmit side of tl^^^Link^ayer to either 
free its buffer resources that corresponds to the packet (i^^k'd be^jseifi'ere were no 
data integrity errors), or retransmit the retry buffer (if>mk'd^ecause t%ere were data 
integrity errors). If repeated attempts to transmit a T^are unS^gssful, the transmitter 
will instruct the Physical Layer to retrain the Hn^^ 



lothej le^^l by allowing the Rx Transaction 
Rbthpr piQ TLP is able to get stored in 



The VMAC takes the Ack/Nak DLLPs to ; 
Layer Module to alter the DLLP depend ing^^n.j 

the Data Buffer. If the TLP cannot l^e^i^^|A^k^ that corresponds to the TLP 
is changed to a Nak DLLP and is^tif^smitte||to tp^ther side of the link. One of the 
reserved bits in the Nak DLLP^ill m% be sSto differentiate between a Nak caused by a 
data integrity error (which c|iQ^ulltoatd^^^ain the link) and a Nak that is caused by a 
buffer retry condition. Tl^bit^is ^^n ^ed in Figure 13 (Note: Using reserved bits 
is acceptable since noQfzera^alfies in^served fields are ignored and will not cause 



errors with other PCrE^p^es^I 



If the TLP has, 



one of the reserve! 




resoli'r(3ies in the data buffer, but is stored in the surplus segment, 

the Ack DLLP will be set to notify the other side that the OSD 

corresp^d^ to tha^amcular TLP is reaching buffer saturation. This bit will be used 
by the^Transf ^n Ajpbiter in the Tx Presentation Module. The Transaction Arbiter will 
skipihe Trans^feff Queue that corresponds to the particular VC/OSD group on the 
proxijmte tum| This will give the Rx data buffer some time to free up its resources. 



fPfogrammlbli^^^^ 


Mode* 


Width 


^Descrlptidnt^'^^l^^v^-W-^^?t':%^'^^^ '^-^i: 


Pl_TOTAL_]VIElVI 


Default 

& 
Buffer 
Retry 


00 
01 
10 

11 


32KB total memory allocated to Port 1 
64KB total memory allocated to Port 1 
96KB total memory allocated to Port 1 
128KB total memory allocated to Port 1 


Pl_VC(n)_MEM 


Default 

& 
Buffer 
Retry 


6 bits 


0x0= 128KB, 0x1 =2KB, 
0x2 = 4KB, 0x3F = 126KB 
where 
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Byte 1 



Pl_VC(nLSURPLUS_MEM 


Buffer 
Retry 
Only 


3 bits 


0x0= 16KB, 0x1 = 2KB, 
0x2 = 4KB, ...,0x7 = 14KB 

where PI VC(n) SURPLUS MEM < PI VC(n) MEM 


Pl_OSD(m)_RSVD_MEM 


Buffer 

ivciry 

Only 


6 bits 


0x0= 128KB, 0x1 =2KB, 
where 


P2_TOTAL_MElVI 


l^eiaUll 

& 

Buffer 
Retry 


lM//\ 


DO Tr^TAI \>fP\>f — lOCVn D1 TTiTAl \/rC\A 
rZ_lU 1 AJ^__IVLtl]Vl — IZoJVD - r l_liJl AL_lVLcM 


P2_VC(n)_MEM 


Default 

& 
Buffer 
Retry 


6 bits 


0x0 = 128KB, 0x1 = 2KB, 

0x2 = 4KB, . . ., 0x3F = 126KB \ 
^^^^^ '^■r ^^-sr 


P2_VC(n)_SURPLUS_MEM 


Buffer 
Retry 
Only 


3 bits 


0x0 = 16KB, 0x1 = 2K4^ V/*^^^ 

0x2 = 4KB, . . ., 0x'L^4:^ \ 

where P2 VC(n).^JRPLUWEMl< P2 VC(n) MEM 


P2_0SD(m)_RSVD_MEM 


Buffer 
Retry 
Only 


6 bits 


0x0 = 128KB,^1^.KB, 

0x2 = 4KB^^x3F%26KB ^ 

where ^^"^ 




Table 4: Programmal:^I|^yari^|^les f^^ata Buffer 






+0 

7|6|5|4|3i2|l|0 


7 1 Mil 2|%f 0 


+ 

7 1 6 1 5 1 4 


2 

3 1 2 1 1 1 0 


+3 

7 1 6 1 5 1 4 1 3 1 2 1 1 1 0 


ByteO 


0000 0000 -Ack 

0001 0000 -Nak 


P'^^X "Reserved 




AckNak_Seq_Num 



16bCRCV 




\Figure 13 : DLLP Format for Ack/Nak Packets 



"pseudo-device discovery 

Once tF^^^ftit is done, the devices are ready to exchange TLPs, so device discovery 
can proceed. This is another optional checkpoint at which we could implement another 
halt bit, halt after fc Jnit. If the I2C set this bit at boot time, the hardware will now 
pause again. Console management software can now come in using the I2C bus and use 
the COP to emulate device discovery by acting like a Root Complex. It could send typeO 
and typel config cycles to the newly-plugged-in I/O device and figure out what VCs, 
buffers, and OSDs that device had available. It would load this information into local 
memory and essentially restart the whole process again for that I/O device. 
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This is an advanced feature that allows software the hooks it would need to set up optimal 
resources across VCs and buffers before real device discovery happens from the OS. So 
the steps would be as follows: 

1. I2C initialization 

2. Link training 

3. OSD negotiation 

4. FC initialization 

5. "Pseudo" Device Discovery 

6. Load results into local memory of console processor 

7. Link training 

8. OSD negotiation 

9. System setup update (use newly calculated optimal buffer 

10. FC initialization 
U . OS Device Discovery 



2.4.6.L11 Discovery 

The discovery mechanism for shared devices us^s^he standard P^I-Express mechanism 

for discovery per OS domain. This mechani§]^s^^%>^.Qgand Typel configuration 




cycles to determine which devices, if any 
cycles is expected within the switch for e€i 




sent^ehind a link. Each one of these 
* main. 



A root complex will begin sending^ypeO ^o^l^s to its southbound PCI-Express 
port. It will discover the switctooi^dir^^tlyl^^ to the switch. It will discover 

the switch as a PCI-bridge, aj|d^bothef|^v^ees on that link. It will then initiate Typel 
CFG cycles to discover the'devjce^^in^iat PCI Bridge on that port. Once inside the 
switch, the Typel CF^i^l^^Il dis^'ver all ports and PCI Bridges that have been 
assigned to that OS dom^n. ^^shared port will appear as assigned to that OS domain. 



When a root cggipf^c^^^fes a shared port, it has no knowledge that the port is shared. 
As such, thd^r^^l afi^^ear to that root complex as a PCI-Bridge, the same as all other 
ports. In t%initial^itc^^implem there will be 16 OS domains possible. Each 

root Qe'^lex^ll be jnapped to one of those OS domains. This mapping will have a 
specMic AS ti^^dl encoding as a 16 port switch. From the switch view, it will always 
hav^vl6 port Switch as its link partner. 




For CFG cycles sent to a shared link, the CFG cycle will be encapsulated within the AS 
PEI8. This will be sent with the turn pool encoding assigned to the appropriate OS 
domain. The response to that CFG cycle will depend on the number of real OS domains 
supported by that link partner. For a given shared link partner, it will support a certain 
number of OS domains. The shared link partner will only respond to CFG cycles which 
are mapped to OS domains that it supports. 

In the following diagram, there is a 4 OS domain shared controller tied to the switch. 
The switch supports 16 OS domains, enumerated as a 16 port virtual AS switch. The 
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Shared controller supports 4 OS domains, enumerated as a 4 port virtual AS switch. The 
switch always sends packets encapsulated to a 16 port virtual AS switch. The controller 
always sends packets encapsulated to a 4 port virtual AS switch. If a port ever sees 
packets encapsulated with turn pool beyond the range it supports, the packets are no 
responded to at the transaction layer. 

This method allows standard PCI discovery to work. In the example below, each OS 
domain maps to a given virtual AS switch port. Root Complex 1 initiates a Typel CFG 
cycle destined for the shared controller. The switch changes the Typel to a TypeO cycle 
at Port 10 prior to sending the CFG cycle to the controller. This CFG cycle/is 
encapsulated to a single virtual AS port, and the controller responds by s^^^ng the 
response on the corresponding return virtual AS port. 

If a Root complex was to exist that was mapped to a virtual AS^port t^at^^i^tfoller 
does not have, e.g. virtual AS port 10, the controller would dg^^e pa(^t. m^the 
transport level, this would be the equivalent of a timeout^ the C^^rea^^and the Root 
complex would assume no I/O device is present on thatrfog^ PCI This mechanism 
allows all devices to be discovered in the same meth^^with : 
know the full fabric topology. _ x 



10 device required to 




Root Complex 2 
Shares lOG NIC 



Port 5 



4 OS Domain 
Sharable lOG NIC 



Figure 2.4-14: Example of a Shared Switch 
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2.5 Error Handling 

Two aspects of a PCI Express switch make its error handling behavior much different 
than that of a PCI Bridge: 

• First, transient link errors can typically be corrected automatically by hardware in 
the Data Link Layer. This eliminates the need to report these errors by setting any 
bits in the PCI status registers and eliminates any problems produced when the 
failed packet was being forwarded on behalf of another device. This is because 
the packet will ultimately be transmitted correctly, so no error handling 
procedures beyond the scope of a single link need to be consid^efed. Repeated 
failures will ultimately cause a fatal error to be recorded and the^^ilty linl^will 
be shut down. ^ 

• Second, non-posted transactions are managed with a xom{5|ti6|^message that 
eliminates the need for a forwarding agent to keep^adc ofWpec^S response 
messages. So a switch can blindly forward transa^ubns^giout^etermining the 
message type or worrying about mapping errorS|^^%to the (^ginating master. 

When a Data Link Layer or Physical Layer error^^ich camot, be associated with a 
particular packet or OSD, the error is logged se^^^j/jfei^each J0SD and error messages 
if any are sent to each port sharing the port w^ 2 

The Nexis Switch implements the PCI J 
registers in addition to the PCI Expressbasil 
mapped error registers. The Ad^^cec 
detailed error logging, error maskl^and er 

a header logging register for%he ife uniiasked error that's logged. Refer to the 
Configuration Registers septibn foiyet^^'^escription of the Advanced Error Reporting 
Capability registers. 




iced Error Reporting Capability 
ror repdrting registers and the legacy PCI 
roi*te|porting Capability registers provide 
severity control registers. It also provided 



2.5,1 Error Types 

PCI Express errors 




as Correctable and Uncorrectable errors. Uncorrectable 



errors are t^rtHsl^as^ied as Fatal or Non-Fatal errors. Errors are also classified based 
on the source ofmje er|9lr as Transaction Layer Errors, Data Link Layer Errors and 



Physic 



r Erroi^. The following sections describe each of the errors and how they 
xis Switch. 



are ^ndling tri^ 

2.5.If^^og!?ectable Errors 

CorrectablFerrors are those which are localized to a single PCI Express link and can be 
automatically corrected by hardware. All correctable errors are automatically corrected 
by a retransmission of the faulty packet. An ERR COR message reporting the occurrence 
of the error may optionally be sent to the root complex. The message is sent only if the 
error is not masked and the SERR Enable bit is set in the Command Register and Bridge 
Control register. 

2.5.LL1 Physical Layer Errors 
• Receiver Error 
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Physical Layer receivers may optionally check for errors and report them by sending 
an ERR_COR message to the root complex. Any DLLP or TLP being received that is 
in error should be discarded and any storage allocated made available. The error will 
be automatically corrected when a NAK DLLP is received and the packet is re- 
transmitted. 

The Physical Layer will check for disparity errors and invalid symbols. If any of 
these errors occur, the Physical Layer will report it. If the error (either one) is part of 
a valid TLP, the Rx Link Layer will send a NAK DLLP for the corresponding TLP 
and will not report an error since it was already done by the Physical Layer. If the 
error is detected in the Data Link Layer in time (within 3 clock cycles^om the start 
of packet), the packet will be purged and the Rx Transaction Layer will^^^ver seMhe 
packet. If the error is not detected in time, the TLP will be forwarded to t^ 
Transaction Layer with an error indication at the end of the pack^^^^ 

2.5.7.7.2 Data Link Layer Errors 

• Bad TLP 

This error is set when the link layer detects a 

o BadCRC 

o Incorrectly nullified packet (TLPl 
inverted) 

o Incorrect packet sequence ^ 

• Bad DLLP 

This error occurs when a CRQd^ck' 

• Replay Timer Timeoufe^X ^ 
This error occurs when thCjRE^LA _^ 
occurs when no ACK^^^^r T%^K ^ 




following. 



utthe LCRCis not 





has been exceeded by a given TLP, which 
received within that time period. This error is 



automatically corrected by.N^^ii)g the TLP and forcing a re-transmission. 



• REPLAY NUMJillla^i;. 



This error c^cc^^^^n^given TLP was unsuccessfully retransmitted REPLAY_NUM 
times. Thi^onditiiSkis ^'^tomatically corrected by signaling the Physical Layer to 
retrap^We^lird^ Onciretraining is successful, the TLP can again be retransmitted (and 
REII.AY_^NIMli&feset). 

2.5.1^^^Unp4rrectable Errors 

Uncorrectable Errors are those which disrupt the functionality of the PCI Express port but 
cannot be corrected by hardware. Using the Advanced Error Reporting Capability 
Registers each uncorrectable error can be configured to be sent with an ERR^FATAL or 
ERR_NONFATAL message to the root complex or can be masked off from sending a 
message. The error messages are sent only if the SERR Enable bit is sent in both the 
Command Register and Bridge Control Register. 



2. 5, L 2.1 Physical Layer Errors 
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• Training Error 

This error occurs when a device fails to establish a link with its partner. An error 
message is sent to the root complex if enabled, and the link is taken down. 

2.5.1.2.2 Data Link Layer Errors 

• Data Link Layer Protocol Error 

This error is caused when the TX MAC receives an ACK/NAK DLLP with a 
sequence number that does not correspond to any of the packets in the retry buffer. 

2.5.1.2.3 Transaction Layer Errors 

• Unsupported Request 

An Unsupported Request Error is generated for the following conditkjm. 
o A request cannot be mapped to any address space mapped 



to any egress ports. 




device or 



o Downstream port of the switch receives a ^i^^urati(S^^(j6est with Device 
number 1-31. The port will terminate the^ti^ns^|ipn an^^not issue it on the 
link. 

o A packet if forwarded to an egress -^^Ssfeg^^gnsmltted across the link, but 
the link is in DL Down state. J^"^ 

ECRC Error " "^^^^ \ 

Logging this error is optional. The ^J@^§!^^^^wiJ^m^ verify the ECRC for forwarded 

packets. All transactions originatin^rom tl^swljcfi (COP) will not contain ECRC. All 
transactions destined to the sw]]^ (lP^|el/TyieO headers and Device specific registers) 
that have ECRC will not be.0Jaecllfei, Tlll^kC Generation Capable and ECRC Check 



Capable bit in the Advan^ E^r ^^^biJTty register is hardwired to zero. 




Malformed TB]^^^^ 
The rec^f^^^ TLP packets generates this event when an inconsistency in the 
formatimi of a'wclolt' is detected at the receiver (destination). There are several 
copS^^o^Jhat r^uire detection and reporting, some others are optional. The table 
slow sho%^^^onditions that are supported and not supported by the switch. 



Supported? 




Hiormed Packet Errors 
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Data payload exceeds max payload size 


Yes 


Actual data length does not match data length 
specified in the header 


Yes 


Start memory DW address crossing a 4KB 
boundary 


No 


TD field = 1 but no ECRC 


No 


Byte enable violation detected 


Yes, only for writes to COP 


Packets with undefined type field 


Yes ^ 


Multiple completions with read data that 
violate RCB 


No "\ 


Completions with a configuration request retry 
status in response to a request other than a 
configuration request 


No (is th^ption^^^^^F'^ 


TC contains a value not assigned to an enabled 
VC within the TC/WC mapping for the 
receiving device 




Transaction type requiring use of TCO hasjC^ 
value other than zero. 


^^(^Albe V2) 

V 


Routmg is incorrect for transaction type^fe.gr 
transactions requiring routing to^R^^^itfJcife^ 
moving away from RC) "^"^^ 


^N(yfmavbe V2^ Tthis is an 
j^LM check) 


Msg/MsgD messages with^jOOb^rlbpting 1 
received at upstream npa?t 


??(ALM check) 


Msg/MsgD message! with 01 f^ioutiI| 
received at dowij^rea^^Pj^^ W 


?? (ALM check) 



A malformed TE^Jwl^^dicarded and an ERR_NONFATAL or ERR FATAL 
message may^j^enl^^ the root complex. No Nak DLLP is sent in response to a 
Malformec^LP, eiM tnb^ow control credits are not updated. A Completion Response 
will nofiifee %3ued fo^ori-posted transactions with a malformed TLP. 




R^^eiver Overflow 

A Rl&ei^er may optionally check for Receiver Overflow errors (TLPs exceeding 
CREDITS_ALLOCATED). If this condition is detected, TLP(s) are discarded 
without modifying the CREDITS_RECEIVED and any resources that had been 
allocated for the TLP(s) are de-allocated. (Not supported right now in the FPGA) 
Completion Timeout 

When a non-posted transaction fails to return a completion message within the 
subscribed time limit, then a completion timeout error has occurred. The Nexsis 
switch will not master any non-posted transactions, and so will never generate a 
Completion Timeout error for any packets going off-chip. 
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Completer Abort 

This error occurs when the Completer of a request is unable to process the request 
due to a component-specific error condition. There are no known conditions for 
which the switch will generate this error. 

Unexpected Completion 

This error occurs when a completion message is received that cannot be matched to 
any outstanding requests. The switch will report this error only for the completion 
targeted to the switch as an endpoint. The ALM detects this condition and notifies the 
MAC to log the error. 

Flow Control (FC) Protocol Error 

The receiver of Flow Control DLLP packets generates this event^tj^ a 
the flow control protocol is detected at the receiver (destination). 'fe^^tepS^eral 
conditions that require detection and reporting, some othef^^^optf^nal. The 
following conditions may be checked and violations mDorted. 
• During FC initialization for any Virtual Channel tl^ ^^lust aB|eltise credits 
equal to or greater than the minimum for that FG^pe ^ 

^^^een made during 
initialization, THEN any future update cpe^it vlke^^^be set to zero. 

Poisoned TLP ""^.^'^ \ V 

A poisoned TLP is one where th^E^lm^he hWder is set indicating that the TLP 
is known to contain an error. I^|e pa^^is a||eldy poisoned the switch will not 

e fmMjtafget of the poisoned TLP. 




issue an error message unle^ it 

If the error is an uncorre^itab%&(^^^Qni 



in the internal data buffers of the switch for 
the EP bit in the header, logs the error and 



one of the transit paqkgts the swl 
forwards the packie.^^^^^"^ 

When a switd^onva^s a pcfeoned TLP, the receiving side must set its Detected 
Parity En-or K^M^il^iyrismitting side must set its Master Data Parity Error bit if 
the Pari^ylE^^es^nse bit in the Bridge Control register is set. 



2.5J^!M^^^rough Error Handling 

If tl^e cut_thru%nbk%it is asserted in the Nexsis Switch may, a TLP could be forwarded 
fron^ie ingr^s port to the egress port before it has been completely received on an 
ingres^j^^^#his complicates the error handling in the conditions where the TLP would 
otherwise be discarded. If an error does occur on a cut-through packet after it has begun 
transmission out an egress port, then the TLP must be 'nullified' to indicate to the 
receiving device that an error has occurred. A TLP is nullified by either using the 
inverted value for its LCRC or by signaling the physical layer that it must use an EDB 



NextIO, Inc. Confidential 
Property of NextIO, Inc. 



Page 49 of 222 



NextIO, Inc. 

© All Rights Reserved. 



NEXSIS Overview Document 

V0.8 



symbol instead of an END symbol as the final framing symbol. The ingress port returns 
a NAK DLLP to the TLP source and the egress port purges the packet from the Replay 
buffer. When the endpoint finally receives the TLP and detects the EDB symbol and the 
inverted CRC, it purges the packet and does not return a NAK DLLP. 

2.5.2 PCI 2.3 Error Reporting 

All PCI Express ports will update the error status bits in the PCI 2.3 configuration space 
as appropriate in order to maintain compatibility with legacy drivers. Note that these 
status bits are independent of any status maintained in the Advanced Error Reporting 
Status or control registers. In particular, setting or clearing an Advanced Error Reporting 
Status bit should not clear the corresponding bit in the PCI 2.3 configuf^yi registers - 
these must be left to be explicitly managed by software and is perfomied by 

Each PCI Express port connects implicitly to a PCI-PCI virtual brMl^^^ac^S* these 
bridges will implement a complete and independent PCI-PCIJ^^ge ty,|p^^iTfiguration 



space. 

Note that the primary bus of each bridge is the one clo 
this can vary depending on how the Nexsis Switch i^^ 

The following sections detail how the Nexsis Switch 
compatible with legacy PCI 2.3 software. 




the Ro^^omplex, and that 
Ilel^^a sy^m. 

shoul^^port errors to remain 



2.5.2.1 Primary side of P2P Bridge 
• Detected Parity Error 

This error will be set when eye^he priBI^Ipb of the internal P2P bridge receives a 
poisoned TLP. In the Nexp||^swit:fe^the MX ingress MAC (Root Port) would set the 
Detected Parity Error b^^^n'^^^Prui^^Status Register when receiving a poisoned 
TLP. 




Signaled System Jl^roi^ 

The TX MACof thfej^^ upllteam port must set the Signaled System Error if an 
uncorrectable e^te|^e^RR_FATAL or ERR_NONFATAL) is transmitted. 



Receiv|a Mal^r ^ybort 
needs to bee 




- - detected only by PCI Express device which originally initiates 

to andjlence not applicable to the Nexsis switch. 



^ceived 'garget Abort 

^l|rro^eeds to be detected only by PCI Express device which originally initiates 
a tran^tion and hence not applicable to the Nexsis switch. Signaled Target Abort - 
hardwired to 0 since we will not abort any transactions as an endpoint. 

Master Data Parity Error 

This error is detected when forwarding a Poisoned TLP from the secondary side of 
the bridge to the primary side. In the Nexsis switch the TX MAC of the upstream port 
must set the Master Data Parity Error bit in the Primary Status register if the Parity 
Error Response bit in the Command Register is set. 
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2.5.2.2 Secondary side of P2P Bridge 

• Defected Parity Error 

This bit is set when the secondary side of P2P bridge receives a poisoned TLP. In the 
nexis switch the Rx MAC of the downstream port must set the Detected Parity Error 
bit in the Secondary Status register when it receives a poisoned TLP. 

• Received System Error 

The Rx MAC of a downstream port must set the Received System Error if an 
uncorrectable error message (ERR_FATAL or ERR_NONFATAL) is received. 
Received Master Abort ^^^^ 

This error needs to be detected only by PCI Express device which ori|iijally initiates 
a transaction and hence not applicable to the Nexsis switch. 

• Received Target Abort 

This error needs to be detected only by PCI Express de>^^|>ich o^ginally initiates 
a transaction and hence not applicable to the Nexsis sj ' 

• Signaled Target Abort 

There are no known conditions for which the switcr^hould'^r-get Abort a 
transaction targeted to it. 

• Master Data Parity Error 

This error is detected when forwardirif^m^ned T^P from Primary to Secondary. 



In the Nexis switch the TX down^JjEeain f^A^ wgjuld set the Master Data Parity Error 
bit in the Secondary Status regji^^er wh^^rai%fitting a poisoned TLP. 




V 
rfner 



2.5.3 Error Reportii^ 

Errors may be reportj^ ^^^^Ker gepVating an explicit error message, or through the 
Completion Status fielaH4n ^Spmpletion header. A completion response is used for 
reporting errors v^m^l^^^mn^^on-posted request, while explicit error messages are 
used for all atl^^pies ofrne'Ssages. Note that a Completion Status may only be used by 
the intende#targ1^^ th^riginating message. Thus, a non-posted message that is being 
routedJmDMh the s'^itcK will never have a Completion generated by the switch, and so 
all egwsVep^gd ^ that message will use explicit messages. The only messages that 
willmse the ContJiISion response to report errors will be those messages that are targeted 
to in^ial registers of the switch itself. 

- 



Note also that only Unsupported Request and Completer Abort errors are reported in a 
completion response. All other errors, even for non-posted requests, will generate an 
explicit error message. 

2. 5.3. LI Completion Status Response 

The format of a completion header used to respond to an error condition is as follows: 



+0 


+1 


+2 


+3 
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7 


6 5 


4 3 2 1 0 


7 


6 5 4 


3 2 10 


7 


6 


5 


4 


3 2 


1 0 


7 


6 5 4 3 2 1 0 


R 


Fmt 
0 1 


Type 
10000 


R 


TC 


Reserved 


T 
D 


E 
P 


Attr 
0 0 


R 


Length 


Completer ID 


CompL 
Status 


BC 
M 


Byte Count 


Requester ID 


Tag 


R 


Lower Address 



Figure 2.5-1: Format of Completion Header 

The following table shows the fields and their values for completion headers reporting 
error conditions: 



Field 


bits 


Description 


Value 


Varia%Iig5s 


Length 


9:0 


Always zero for error 
completions 


0 . 




R 


11:10 


reserved 






Attr 


13:12 


Copied from request header 


— \ 




EP 


14 


Indicates TLP is poisoned 




■'No 


TD 


15 


Indicates presence of TLE^dig^st 


V 


No 


Reserved 


19:16 


reserved '^*'*'^^^^'' 


>^ 


No 


TC 


22:20 


Copied from requl^l^febr "^^.^ 




Yes 


R 


23 


reserved A ^m, w 
/ 


0 


No 


Type 


28:24 


Indicate^s^Ms^toe 


10000 


No 


Fmt 


30:29 


— — ^ — ^ 

Indi<fl%$,4 h^/ h^ef,'no data 


ObOl 


No 


Byte 
Count 


43:32 


The remairii^byte count for 




Yes 


BCM 




y^^g^^ 


0 


No 


Compl. i 
Stati^^V 


^47:45 

^^^^^^ 
\ 


^O^^Successful Completion 
0|1 = Unsupported Request 

^10 = Configuration Request 
Retry Status 
100 = Completer Abort 




Yes 


Complefc 
ID 


^:48 


Bus #, device # and function # of 
unit generating completion (and 
reporting error) 




Yes 


Lower 
Address 


70:64 


Unused 


0 


No 


R 


71 


reserved 


0 


No 


Tag 


79:72 


Copied from request header 




Yes 


Requester 


95:80 


Copied from request header 




Yes 
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ID 











Table 2.5-1 Completion Header Fields 
2*53 A. 2 Explicit Error Messages 

Explicit error messages are generated for every correctable or uncorrectable error which 

is not masked in the Advanced Error Capabilities register. For shared ports an error 
message for an error from a port which cannot be associated specifically to an OSD (for 
example, Training Error or Repl y Timer Timeout ) should be sent to all the RCs sharing 
the downstream switch port. \^__ 



Error messages generated are one of the following three types: 



Type 


Description ^ 


ERR_COR 


Issued when component or device dete^s"^ 
correctable error on the PCI Expr^^^ ^ 
interface ^\ 


ERR_,NONFATAL 


T^hT^ ^ 

Issued when the component or de^yce 
detects a Non-fatal, uncopgo^le el^r on ^ 
the PCI Express interfaypts^ ^^^^^^fe, 


ERR_FATAL 


Issued when the com^nent or 8e^ce 
detects a Fatal, uncor]^talfie*ef^ro^n the 
PCI Express irvt^^e^, / 




The format of all error messj^ges^ 




Exp|i|ess^Error Messages 
in.4he following table: 



+0 




+2 


+3 


7 


6 5 


4 3 2 






^3210 


7 


6 


5 4 


3 2 


1 0 


7 6 5 4 3 2 1 0 


R 


Fmt 
0 1 


g)000 


\ 


TC * 
WO 


Reserved 


T 
D 


E 
P 


Attr 
0 0 


R 


Length 
0 


%,Requeier ID 


TagO 


Message Code 



Figure 2.5-2: Format of Error Messages 
All error niessages are 16-bytes in length, with the following fields and values: 




Field 


bits 


Description 


Value 


Variable 


Length 


9:0 


Unused - always zero 


0 


No 


R 


11:10 


reserved 


0 


No 


Attr 


13:12 


Attributes - always zero 


ObOO 


No 


EP 


14 


Indicates TLP is poisoned 


0 


No 
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1 u 


1 c 

I J 


inaicaies presence oi i ur oigest 


U 


No 


Reserved 




reservea 


A 
U 


No 


1 c 




Traffic class, must be zero 


u 


No 


R 


23 


reserved 


0 


No 


Type 


28:24 


Indicates 'Msg' type 


10000 


No 


Fmt 


30:29 


Indicates 4 DW header, no data 


ObOl 


No 


R 


31 


reserved 


0 


No 


Message 
Code 


39:32 


0011 0000 = ERR_COR 

001 1 0001 = ERR_^NONFATAL 

0011 0011 =ERR_FATAL 




Yes / 


Tag 


47:40 


Unused (no completion required) 


0 . 




Requester 
ID 


63:48 


Bus #, device # and function # of 
unit reporting error 







Table 2.5-3 Error Message 



I The spec re quires all error messages to use 





g Register where the header 



a 4 DW header. This Error Message field is takj 
of the first packet in error is stored. 

2.5.5.7.5 Message Routing 

The three Isb's of the message type^^ndicat^he il^ssage routing mechanism, which is 
always *000' for error messages ;:4i|dicati|g mes^ge should be routed to the Root Complex. 




2.5.4 Header Loggii 

As part of the Advan%d IWr Repdffing capabilities the Nexsis switch logs the TLP 
header for the fji^t uncd^ecta1|k< transaction layer error reported. Headers are logged 
only if the niask^lM^(^Mbt^^fesponding error is not set in Uncorrectable Error Mask 
register ancLthlPJ^^rr^ctableError Status bit pointed to by the First Error pointer is not 
set. Headers are 'ogfed i%a 4 DWORD register for the following errors. 

3d TLIf received 

ECRClGHeTcIc Failed ( error is not supported) 

Jnsuprforted Request 

miction Abort (error is not supported) 

Unexpected Completion 

Malformed TLP 



There are no variations in header logging logic for shared or non shared port. 
2.5.5 Error Tables 



NextIO, Inc. Confidential 
Property of NextIO, Inc. 



Page 54 of 222 



NextIO, Inc. 

© All Rights Reserved. 



NEXSIS Overview Document 

V0.8 



The following table lists the errors that may occur and be detected on a single PCI 
Express link. 



Layer 


Error Name 


Default Severity 


Detecting Agent Action 


P 


Receiver Error 


Correctable 


Send ERR COR to Root Complex 
(RC) 


Training Error 


Uncorrectable 
(Fatal) 


Send ERR_FATAL to RC 


D 


Bad TLP 


Correctable 


Send ERR COR to RC 


Bad DLLP 


Send ERR COR to RQ 


Replay Timeout 


Send ERR CORtoF^tk. \ 


REPLAY NUM 
Rollover 


Send ERR_COI^ RC ^^^^ 


Data Link Layer 
Protocol Error 


Uncorrectable 
(Fatal) 


Send ERR^ATAikte^^' 


T 


Poisoned TLP 
Received 


Uncorrectable 
(Non-Fatal) 

AM 


Sen^^_NOff^A'f^L to RC 
LQ^e^S^ofTL^^ 


ECRC Check 
Failed 


^d ERR^NFATAL to RC 
JL.pg H^der oTTLP 


Unsupported 
Request (UR) 


'S"ena*lR_NONFATAL to RC 
fbg^ header of TLP 


Completion 
Timeout 




• S^nl^RR NONFATAL to RC 


Completer Abort 




''Send ERR NONFATAL to RC 


Unexpected 
Completion ^ 


Send ERR_NONFATAL to RC 
Log header of Completion 


Receiver Overfl^ 




Send ERR FATAL to RC 


Flow Contrali^^ 
Protocol Error \^ 


Send ERR_FATAL to RC 


Malfbn«®^ 


Send ERR_FATAL to RC 
Log header of TLP 




Y Table 2.5-4: PCI Express Link Errors 

ence ^f^y of the above correctable errors can be flagged in the Correctable Error 
Registe^a^asked in the Correctable Error Mask Register. 
Occu^nce of any of the above uncorrectable errors can be flagged in the Uncorrectable Error 
Status F^^^nd masked in the Uncorrectable Error Mask Register. Additionally, the 
uncorrectable errors can be programmed to be reported as either a fatal or non-fatal error by use 
of the Uncorrectable Error Severity Register. These registers are replicated for each OSD. 

2.5.5.1,1 Error Signaling and Logging 
Legend: 

Type: C=correctable, NF=non-fatal, F=fatal - indicates type of error message that is generated 

S= Supported N=Not Supported 

Italicized errors correspond to specific error bit set 
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2.5.5.1.2 Rx Physical Layer Errors 



Error 


S/N 


Packet Behavior 


Reported Error 


Type 


Invalid symbol or 
disparity error 


S 


Discard pkt if error is part of 
pkt. Schedule Nak if TLP 


Rx Error 
Rx Port Error 
reported by PLM 
(does not report if 
it was part of a pkt 
or not) ^ 


C 


Link Error 


s 




Receiver W^jr 


C 


Training Error 


s 


N/A 


Training Errd^^ 





Table 2.5-5: Rx Physical Layer Errors 
2.5.5.1.3 Rx Data Link Layer Errors 




Error 


S/N 


Packet Behavior ^ 


%epor^d Error 


Type 


Invalid sequence 
number on ACK 
orNAKDLLP 


S 


Discard DLLP V 

^ ^ 


Dp^Protocol 
Error 


F 


Duplicate Seq. 

Number on TLP 


s 


Discard TLP^i^Sle 


No error 




Unexpected Seq. 
Number on TLP 


s 


Disca^LP,l^dWiir 


Bad TLP 


C 


LCRC error on 
TLP 


s 


^s^^ T^^^ess it's cut- 
■ througft|L S^aule Nak 
I^^P. 


Bad TLP 


C 


LCRC error on 
DLL? 


•'\ 


l^iscard E)LLP 


Bad DLLP 


C 


DLLPw/ 

unsupporte#typ^^ 
encodings I, ^ 


iV 


discard DLLP 


No associated 
Error 




FC Iqiff^otl^ 
violations ^bae* 


'.s/i? 


N/A 


DLL Protocol 

Error 


F 


Rx I^%iing j 
Violati%s^y 


s 


Discard pkt, send Nak 


Bad TLP 


C 


TLP with EDB 
and inverted 
LCRC 




Discard pkt. No Nak 
scheduled 


No associated 
Error 




Nullified TLP 
without EDB 


S/N? 


Discard TLP 


Receiver Error 


C 



Table 2.5-7 



2.5.5.1.4 Rx Transaction Layer Errors 



Error 



Opt I Packet Behavior 



Reported Error | Type 
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Unmapped address 


S 


Discard TLP 


Unsupported 
Request 


F 


FC overflow 


N 


Discard pkt. No Nak, No FC 
update. 


Receiver Overflow 


F 


Malformed TLP * 


S 


Discard TLP, No Nak, No 
FC update 


Malformed TLP 


F 


Pktjength field < 
actual pkt length 


S 


Truncate pkt (or discard if 
S&F). 


Malformed TLP 


F 


Pktjength field > 
actual pkt length 


s 


Stop at END. Discard if 
S&F. 

No Nak, No FC update 


Malformed TLP 


F 


Unexpected 
Completion ^ 


s 


Discard completion 


Unexpected 
Comp^mii^ 


r 


Request violates 
programming 
model of Rx device 


N 


Discard request 


0)mpleli^^S^o^^ 


NF 


Rx device unable to 
process request due 
to device-specific 
error condition 


N 


Discard request ^^^^^ ^ 


^^)le^ Abort 

'it. 


NF 


Advertising more 
than 2048 credits 
for pay load and 128 
for header 






Flow Control 
Protocol Error 


F 


Did not advertise 
FC credit values >= 
min defined in 
Table 2-27 of PCI 
Express spec 


/ 




Flow Control 
Protocol Error 


F 


Non-zero credit 
values receivg^^ ^ 
after infinite cre3i^ 
advertised 1 


.V 




Flow Control 
Protocol Error 


F 


Linl^l3lL_^^n j 
stati&^ ^tec^ 




Return completion as 
Unsupported Request, 
discard request 


Unsupported 
Request 


F 


Receiv#^i0y|5<^ed 
TLP 


s 


Pass TLP thru, unless 
directed to switch, then 
discard (and return UR for 
non-posted requests) 


Poisoned TLP 
Received 


NF 



See conditions for malformed TLPs under section □ Malformed TLP on page 47. 
^ Nexsis Switch will never expect completions, so can just ignore them and let them pass 
thru switch. 



Table 2.5-8 Rx Transaction Layer Errors 
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2.5.S.L5 Tx Data Link Layer Errors 



Error 


S/N 


Packet Behavior 


Reported Error 


Type 


REPLAY_NUM 
rolls over 


S 


Retrain the link 


REPLAY_NUM 
Rollover 


C 


REPLAY_TIMER 
expires 


S 


Retry entire buffer 


Replay Timeout 


C 



Table 2.5-9 Tx Data Link Layer Errors 
'2.5.5.L6 Tx Transaction Layer Errors 




Error 


Opt 


Packet Behavior 


Repo|(|!dLError%i^ 




Invalid TC, OSD 
or type 




Nak the transaction coming 
from the Switch Core. ^ 




1 


TLP length 
exceeds 

Max Payload Size 




Discard TLP 

(for TLPs with data payload|*'^^ 
only) >\ " 






Actual packet 
length is greater 
thanpkt length 
field. 




Truncate and nullifVm^et?V 


^^^ormed TLP 




Actual packet 
length is less than 
pkt length field 




NuUi^^eK^^^^^^ ' 


Malformed TLP 




Completion 
Timeout ^ 






Completion 
Timeout 


NF 



The Nexsis Switch wilj 
ever be enabled 




any R^pests requiring Completions, so no timeout should 
le 2^5-10 Tx Transaction Layer Errors 



2.6 

The^ are two^Sic^Components to Quality of Service in our switch. First, the Traffic 
Clas%TC) fieM in the transaction allows the driver to differentiate certain transaction 
flows M^pe them be mapped into a Virtual Channels (VC). So a particular message 
might be labeled with a TC of 7 for high-priority, while standard memory reads and 
writes might be labeled with a TC of 0. Our switch supports all 8 VCs. Second, our 
switch allows different OS Domains to share an I/O port, and each OSD will 
automatically receive its fair share of the bandwidth when that port is congested.. 



2.6.1 TCA^C Mapping 

The PCI Express spec lists a 3-bit Traffic Class (TC) field that is present in the 
transaction header. This field is used to differentiate traffic so that it can be prioritized, 
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queued, and scheduled independently inside the various PCI Express devices. Although 
the spec says that anything other than TCO is optional, our switch will accommodate all 8 
traffic classes. 

TC's are mapped onto Virtual Channels (VCs) in a vendor-defined manner according to 
the PCI Express spec. For our switch, each incoming TLP will have its TC field mapped 
into a VC based on software configuration settings. TCO must always be mapped to 
VCO, but all other TCs may be mapped to VCs without restriction individually on a per- 
port basis. 



Each RX MAC will have a TC mapping table that's just a flat-mapped lo€^p table that 
turns an OSD and TC into a Source QID. The Source QID is needed to kno 
the 16 flow control resources were charged with the credits of the inclfejiiig TV 



Once the destination port and destination QID are returned ^nil^ 
address_lookup_module, the transaction can actually be 5#ued i^^w l|iat all the data 
for the transaction is known. ^ ^ 




2.6.2 Arbitration points 

There are three arbitration points inside our switch 
transaction scheduler and one in the data movet; 



v®%^ supported, two in the 
.;ie transaction_scheduler, there are 
17 sets of arbiters that run independently, ^n^^rper|)u^ut port. Each set of arbiters 
will ensure each input port is allowedjialf^iisha^of tfe output port's bandwidth, and 
each OSD is allowed its fair shar^^ well^^j^l thi^/levels of arbitration will be 
discussed in the next few sectic 



2,6.2.1.1 Port arbitratii 

Port arbitration is the 0\ 
ARB! in the diagram o^ 
scheme to make,,sS^n< 
scheme is fixgjd^' ' ^ 




St 



m ftKansa^o^^cheduler) 

wi^rin each port_arbiter at an output port (shown as 
page). This level of arbitration will use a simple RR 
s|^ed and the bandwidths are balanced. This RR 
WRR is supported. 



2.6.2.1^ (^D/VCWbitration (transaction_scheduler) 

Sinci^her'fe ^^^^u|put buffer groups for each set of arbiters, a second level of RR 
arbi^tion (AI|B2fto pick which transaction will be selected as the next one to be 
transimtted on^a given output port. This RR scheme is fixed in hardware, no WRR is 
suppor "'^ 

2.6.2.1.3 Input arbitration (datajmover) 

The data_mover is the final stage (ARB3) that selects which input port is actually 
allowed to transfer data to each output port. Another level of RR is run to make sure 
each input port is serviced when more than one input port has data to send to a given 
output. 



Note that the data mover will skip over the "best" choice from the transaction_scheduler 
if a different input port that is idle is able to move its data instead. In this case, a skip 
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flag will be set such that the data_mover will not continue skipping the "best" choice 
more than once. Without this skip flag, one output could get very unlucky and get 
skipped over and over, stalling a transaction potentially indefinitely under certain traffic 
loads. 

The following picture shows these levels of arbitration. 



transaction_8cheduler 



output_port_0 



OSDOA/CO 
(group 0) 

In put port 0 




data mover 



Figure 2.6-1 : Transaction Scheduler Block Diagram 
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2.7 System L vel Manag ment Examples 



2.7.2 Hot Plug Add of I/O Device 

Our device will integrate a hot-plug controller, allowing system software and/or chassis 
management software access to several registers to control hot-plug ftinctions. These 
registers are implemented in the PCI-Express capabilities structure and can be used to 
turn on and off LEDs, exchange information with the Power Controller in the chassis, and 
report the status of the various slots in our system. 

Once a link has been trained, OSDs have been negotiated (if necessary), 
initialization is complete, the hot plug process can begin. 

For hot insertion, this process will start with the user pressing the At^Wttim^uttc^f'Chich 
will set a register bit in our switch for those P2P bridge heade^^The h^P ^fJ^^end an 
MSI up to the hot plug services software running on each QS^o^^ch O^D that share the 
newly-plugged-in I/O device. The hot plug services sof]y^%e wilffltej^ beSin device 
discovery. 




2.7.2 Jd Hot-plug messages 

There are 7 different hot-plug messages defmMin^P^p^s. Every message of this 
type will be routed to the COP for proces^^^ (a^ress^ppkup_module with match the 
bridge header space) if the message is addrel^m^ dne^]^he P2P bridge headers in our 
switch. These are only required if t^^^^rl^^fmented on the downstream I/O 
device. If the LEDs are directly b^^e swi^(upTg GPIO), these messages won't be 
used. A 

2.7.2.1.2 Attention_B^^n^res^i^ 

The Attention Button^|ss^kegister^it must be set to a 1 if our switch receives this 
message from a d^nstrlam P<^>%/Jf our switch detects the attention button was pressed 
for a given slot ("t^^i^^g^our switch, one for each port), our switch will generate an 
MSI and send4t^B^to^he hot plug services software. 

2. 7.2.1^ A^ntionwndicator Ony Attention_Indicator_Blink, 

^Ati$ntiorJ Indicator Off 



ThesLthree messages will be generated by the COP and sent downstream depending on 
the stat^of the^attention indicator bit. 



2.7.2.1.4 Power lndicator On, Power_Indicator_Blink, Power_Indicator_Off 

These three messages will be generated by the COP and sent downstream depending on 
the state of the power_indicator bit. 
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Blade 0 Blade 1 



Blade 2 Blade 3 Blade 4 Blade 5 Blade 6 Blade? 




OSS fcs!*! 



la=1 



^^^3 ^^^3 ir^F?! 




Figure 2.7-2: Logical Bus View 
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